1

I have an Excel spreadsheet with almost 40,000 rows of entries. I would like to split this Excel file into multiple files based upon the values in Column C starting with Row 6. I've been able to determine how to split the file, but the challenge I seem to be having is getting the header rows to carry over. This is a specific template for the ArchivesSpace Application and for whatever reason the information in Rows 1-5 must be present. I've tried deleting this information and using only the field codes and that was not successful. Here is the code I've tried:

import pandas as pd
import os
import openpyxl

df = pd.read_excel('container_list_master.xlsx')
column_name = 'ead'
unique_values = df[column_name].unique()

for unique_value in unique_values:
    df_output = df[df[column_name].str.contains(unique_value)]
    output_path =os.path.join('output', unique_value + '.xlsx')
    df_output.to_excel(output_path, sheet_name=unique_value, index=False)

Here is an image of the Excel Spreadsheet

1
  • I'd suggest working directly with openpyxl in such cases. Commented Feb 9, 2024 at 9:34

1 Answer 1

1

With some minor alterations, your code can be adapted to produce your desired output.

The couple main points you were missing to have it all:

  • df = pd.read_excel('container_list_master.xlsx', skiprows=range(4)) This line has the skiprows parameter which does exactly what it sounds like and allows you to skip the first few non-table lines while reading in the file.
  • df_output.to_excel(output_path, sheet_name=unique_value, startrow=4, index=False) Same concept for startrow as skiprows, except for writing.
  • Creating a writer object, then using that writer object to put in some of your header cells as suggested in this answer, which is demonstrated below.
import pandas as pd
import os
# import openpyxl

df = pd.read_excel('container_list_master.xlsx', skiprows=range(4))
column_name = 'EAD ID'
unique_values = df[column_name].unique()

for unique_value in unique_values:
    df_output = df[df[column_name].str.contains(unique_value)]
    output_path = os.path.join('output', unique_value + '.xlsx')
    writer = pd.ExcelWriter(output_path)
    df_output.to_excel(writer, sheet_name=unique_value, startrow=4, index=False)
    
    writer.sheets[unique_value].cell(1, 1, 'This is the template for importing ...')
    writer.sheets[unique_value].cell(2, 1, 'Mapping - ArchivesSpace ... SECTION')
    writer.sheets[unique_value].cell(3, 1, 'Mapping - ArchivesSpace ... FIELD')
    writer.sheets[unique_value].cell(4, 1, 'ArchivesSpace field code ...')
    writer.sheets[unique_value].cell(2, 2, 'Resource ...')
    writer.sheets[unique_value].cell(3, 2, 'The resource ...')
    writer.sheets[unique_value].cell(4, 2, 'Collection id ...')
    writer.sheets[unique_value].cell(2, 3, 'Resource ...')
    writer.sheets[unique_value].cell(3, 3, 'EAD ID ...')
    writer.sheets[unique_value].cell(4, 3, 'ead ...')
    writer.sheets[unique_value].cell(4, 4, 'res_uri')
    writer.sheets[unique_value].cell(4, 5, 'ref_id')
    writer.sheets[unique_value].cell(4, 6, 'title')
    writer.sheets[unique_value].cell(4, 7, 'something unique')
    writer.close()

Here is the input I used, which is a vague rendition of your screen cap: demo input

And here is the output it generated: demo output file

The output looks a little tacky, but if this is just an import for a software then who cares? If the answer is you, then here is a link with information about how to format it as well: xlsx writer documentation or from another excellent answer on this site.

Let me know if you have any questions!

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.