Split Excel file into multiple files using Python

Question

I have an Excel spreadsheet with almost 40,000 rows of entries. I would like to split this Excel file into multiple files based upon the values in Column C starting with Row 6. I've been able to determine how to split the file, but the challenge I seem to be having is getting the header rows to carry over. This is a specific template for the ArchivesSpace Application and for whatever reason the information in Rows 1-5 must be present. I've tried deleting this information and using only the field codes and that was not successful. Here is the code I've tried:

import pandas as pd
import os
import openpyxl

df = pd.read_excel('container_list_master.xlsx')
column_name = 'ead'
unique_values = df[column_name].unique()

for unique_value in unique_values:
    df_output = df[df[column_name].str.contains(unique_value)]
    output_path =os.path.join('output', unique_value + '.xlsx')
    df_output.to_excel(output_path, sheet_name=unique_value, index=False)

I'd suggest working directly with openpyxl in such cases.

Charlie Clark
– Charlie Clark

2024-02-09 09:34:49 +00:00
Commented Feb 9, 2024 at 9:34 — Charlie Clark
– Charlie Clark, Commented Feb 9, 2024 at 9:34

BitsAreNumbersToo · Accepted Answer · 2024-02-09 07:20:57Z

With some minor alterations, your code can be adapted to produce your desired output.

The couple main points you were missing to have it all:

df = pd.read_excel('container_list_master.xlsx', skiprows=range(4)) This line has the skiprows parameter which does exactly what it sounds like and allows you to skip the first few non-table lines while reading in the file.
df_output.to_excel(output_path, sheet_name=unique_value, startrow=4, index=False) Same concept for startrow as skiprows, except for writing.
Creating a writer object, then using that writer object to put in some of your header cells as suggested in this answer, which is demonstrated below.

import pandas as pd
import os
# import openpyxl

df = pd.read_excel('container_list_master.xlsx', skiprows=range(4))
column_name = 'EAD ID'
unique_values = df[column_name].unique()

for unique_value in unique_values:
    df_output = df[df[column_name].str.contains(unique_value)]
    output_path = os.path.join('output', unique_value + '.xlsx')
    writer = pd.ExcelWriter(output_path)
    df_output.to_excel(writer, sheet_name=unique_value, startrow=4, index=False)
    
    writer.sheets[unique_value].cell(1, 1, 'This is the template for importing ...')
    writer.sheets[unique_value].cell(2, 1, 'Mapping - ArchivesSpace ... SECTION')
    writer.sheets[unique_value].cell(3, 1, 'Mapping - ArchivesSpace ... FIELD')
    writer.sheets[unique_value].cell(4, 1, 'ArchivesSpace field code ...')
    writer.sheets[unique_value].cell(2, 2, 'Resource ...')
    writer.sheets[unique_value].cell(3, 2, 'The resource ...')
    writer.sheets[unique_value].cell(4, 2, 'Collection id ...')
    writer.sheets[unique_value].cell(2, 3, 'Resource ...')
    writer.sheets[unique_value].cell(3, 3, 'EAD ID ...')
    writer.sheets[unique_value].cell(4, 3, 'ead ...')
    writer.sheets[unique_value].cell(4, 4, 'res_uri')
    writer.sheets[unique_value].cell(4, 5, 'ref_id')
    writer.sheets[unique_value].cell(4, 6, 'title')
    writer.sheets[unique_value].cell(4, 7, 'something unique')
    writer.close()

Here is the input I used, which is a vague rendition of your screen cap:

And here is the output it generated:

The output looks a little tacky, but if this is just an import for a software then who cares? If the answer is you, then here is a link with information about how to format it as well: xlsx writer documentation or from another excellent answer on this site.

Let me know if you have any questions!

Collectives™ on Stack Overflow

Split Excel file into multiple files using Python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related