Splitting Large CSV Files into Chunks Using the split
Command in Linux 1
When working with large CSV files, such as during data migrations to a PostgreSQL database, processing the entire file at once can be time-consuming and resource-intensive. Splitting the file into smaller chunks allows for parallel processing and easier management of the data.
Using the split
Command
The split
command is a powerful tool for dividing large files into smaller, more manageable pieces.
Example Command
split -l 5000 "sheet.csv" split --additional-suffix=.csv
This command splits the sheet.csv
file into multiple smaller files, each containing 5,000 lines. The output files are named sequentially as splitaa.csv
, splitab.csv
, splitac.csv
, and so on.
Explanation of Options:
split
: The command used to split files.-l 5000
: Splits the file into chunks of 5,000 lines each (-l
stands for lines)."sheet.csv"
: The input file to be split.split
: The prefix for output files.--additional-suffix=.csv
: Appends the.csv
extension to each output file.
Output Example:
splitaa.csv
splitab.csv
splitac.csv
...
Why Use split
for Large Files?
- Efficient Processing: Work on smaller chunks individually or in parallel.
- Faster Database Imports: Load smaller pieces into PostgreSQL with transactions.
- Easy File Management: Manage and track parts of large datasets conveniently.
Practical Use Case
While migrating a large dataset from a CSV file to a PostgreSQL database, I encountered performance issues due to the file's size. By using split
, I divided the data into smaller parts and imported them one by one, significantly reducing processing time.