Writing a CSV splitter in Python

How I created a tool to split CSV files with less than 50 lines of code

The Issue To Solve

Lately in my coding journey I’ve been writing tools that help solve small productivity issues. This week I was talking with someone who had an issue related to splitting up data into multiple spreadsheets, so I decided to write a tool to help them complete that task.

The request they had was a tool that could take some spreadsheet data that contained around 100,000 rows and split it into individual spreadsheets that contained about 10,000 rows each, while making sure each new spreadsheet file retained the proper column names listed in the first row.

To solve this issue I again decided to use Python because it ships with so many useful modules. The module I used in this case is of course called csv. The Python docs explain its functionality quite well:

The csv module implements classes to read and write tabular data in CSV format. It allows programmers to say, “write this data in the format preferred by Excel,” or “read data from this file which was generated by Excel,” without knowing the precise details of the CSV format used by Excel.

This was perfect for me as I was dealing with CSV files that were exported directly from Excel, and would need to be imported back to an Excel workbook. In my research I did discover a Python Library called pandas that can import and export excel files natively, but I wanted to focus on creating something that could be used with a default Python 3 installation.

Creating The CSV Splitter

Below I’ll outline the steps I took to create the tool, explaining each function and then showing the final script.

Main()

The main function does a few things. First it takes in input from the user to grab the location of the original CSV file and then asks how many rows each split CSV file should contain.

Next it opens the CSV file and assigns it to csv_file. A DictReader object is then created from this csv file, and the column headers are copied over using the fieldnames parameter of the DictReader object.

This info is then passed on to the loop_through_csv() function, which in turn calls the write_new_csv() function as explained below.

write_new_csv()

The write_new_csv() function takes in a csv file that is currently open and being written, the names of the column headers in the original csv (stored as an array of strings), and the split_number which assigns a number to each CSV file as they are split and saved. The names would end up being for example: CSV_1, CSV_2, CSV_3, etc.

The function then takes the open CSV file and creates a DictWriter object, assigning the column_headers array to the fieldnames parameter. With that in place, the header row is written using the writeheader() function.

I return the DictWriter object so I can then loop through each row and continue writing to the same CSV file using writerow() until it’s time to close the file and create a new CSV file again with an incremented filename. As you saw in main(), the original CSV file is being read as a dictionary, so writing

loop_through_csv()

This is where the important logic of the script lives. It’s a bit longer of a function so I’ll show it here and explain below.

First I set up a typical counter variable for each row, a split_number which again assigns a number to each new CSV file being written, and the script starts out with write_csv set to None because nothing is set to be written until a few lines later when some checks are done on the CSV.

Then we enter the for loop, which will iterate over every single row in the original CSV file. First it checks if row_count is set to 0 or the split threshold has been met, and in either case it will open a new CSV file to be written to the disk. When the script hits this path we can assume it’s either at the very beginning of the file or it’s writing a new CSV file, so there’s some logic to check if the previous CSV file needs to be closed so it doesn’t take up extra memory.

The row_counter is also set back to 0 at the end of this code path, because no matter what at the end of the loop the row_counter will be incremented by 1. This logic ensures that the very first row of the original CSV file, which contains the column headers, is only ever referred to once, and then for the rest of the file the row_counter increases until the split_threshold is met, in which case row_counter gets set back to 0, and 1 again before repeating the same logic for the next CSV file being written.

The writer.writerow(row) function does the main work of writing a new row for each iteration of the for loop. The reason I chose to use the dictionary format for reading and writing fields is because once the proper column headers have been set for each CSV file being written, I can simply pass each row to the writerow() function, and their formats will match perfectly. Here’s an example of how the formatting works:

Full Script

I tested the tool out on many different CSV files and it’s working great. For the next iteration I might add a basic GUI so users can browse for CSV files or try to create a binary file so it can be run on systems that don’t have Python 3 installed.

Software Developer and Video/Post Production Professional. Recent graduate of Flatiron School.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store