0

I have excel spreadsheets I would like to concatenate into a pandas dataframe, however the table ranges entered into the spreadsheets are irregular. The data entered might begin at say, C5, D8, G4 etc. in each spreadsheet. The example below shows that it starts at B5.

I would not know where the table would begin in each spreadsheet or specify which sheet in each workbook, as there's a few hundred. I intend to compile all sheets into a dataframe, then extract the rows of data which I need. The data is mostly in the same format but I would also need to bear in mind any notes within the spreadsheets.

It would be simpler if the data in each spreadsheet was aligned together, then I could extract the rows I need with index labels. Is there a way to align all of the data in each spreadsheet to begin in the first column of each spreadsheet?

Here is what I have so far:

import os
import pandas as pd
import glob
import numpy as np

path =r'dir'
allFiles = glob.glob(path + "/*.xlsx")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_excel(file_,index_col=None, header=0)
    list_.append(df)
frame = pd.concat(list_)

print(list_)

3 Answers 3

2

Here's a solution with openpyxl

No need to save new files or pre-load data into memory

import itertools

from openpyxl import load_workbook
from pandas import DataFrame

def get_data(ws):
    for row in ws.values:
        row_it = iter(row)
        for cell in row_it:
            if cell is not None:
                yield itertools.chain((cell,), row_it)
                break

def read_workbook(filename):
    wb = load_workbook(filename)
    ws = wb.active
    return DataFrame(get_data(ws))

You can easily modify the code to limit the max number of steps you take before considering the row empty

Sign up to request clarification or add additional context in comments.

Comments

0

You could try converting the tables to csv and striping the leading commas.

with open("your_file_as_csv", 'r') as file_in, open("output_as_csv", 'w') as file_out:
    for line in file_in:
        file_out.write(line.strip(','))

That would at least remove blank lines and align everything to the first row and first column.

But note that in your example you will have troubles with the row 2 containing "summary, 2017".

Are you sure all your tables have the same format (columns labels, order, number ?)

Comments

0

You can use this functions:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

df = df.dropna(axis=0, how='all')
df = df.dropna(axis=1, how='all')

writer = pd.ExcelWriter('out.xlsx')
df.to_excel(writer, 'out')
writer.save()

Before:

enter image description here

After:

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.