Aligning data in python

Question

I have excel spreadsheets I would like to concatenate into a pandas dataframe, however the table ranges entered into the spreadsheets are irregular. The data entered might begin at say, C5, D8, G4 etc. in each spreadsheet. The example below shows that it starts at B5.

I would not know where the table would begin in each spreadsheet or specify which sheet in each workbook, as there's a few hundred. I intend to compile all sheets into a dataframe, then extract the rows of data which I need. The data is mostly in the same format but I would also need to bear in mind any notes within the spreadsheets.

It would be simpler if the data in each spreadsheet was aligned together, then I could extract the rows I need with index labels. Is there a way to align all of the data in each spreadsheet to begin in the first column of each spreadsheet?

Here is what I have so far:

import os
import pandas as pd
import glob
import numpy as np

path =r'dir'
allFiles = glob.glob(path + "/*.xlsx")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_excel(file_,index_col=None, header=0)
    list_.append(df)
frame = pd.concat(list_)

print(list_)

Grisha S · Accepted Answer · 2017-08-16 22:08:59Z

2

Here's a solution with openpyxl

No need to save new files or pre-load data into memory

import itertools

from openpyxl import load_workbook
from pandas import DataFrame

def get_data(ws):
    for row in ws.values:
        row_it = iter(row)
        for cell in row_it:
            if cell is not None:
                yield itertools.chain((cell,), row_it)
                break

def read_workbook(filename):
    wb = load_workbook(filename)
    ws = wb.active
    return DataFrame(get_data(ws))

You can easily modify the code to limit the max number of steps you take before considering the row empty

answered Aug 16, 2017 at 22:08

Grisha S

8181 gold badge6 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

FabienP · Accepted Answer · 2017-08-16 20:50:04Z

0

You could try converting the tables to csv and striping the leading commas.

with open("your_file_as_csv", 'r') as file_in, open("output_as_csv", 'w') as file_out:
    for line in file_in:
        file_out.write(line.strip(','))

That would at least remove blank lines and align everything to the first row and first column.

But note that in your example you will have troubles with the row 2 containing "summary, 2017".

Are you sure all your tables have the same format (columns labels, order, number ?)

answered Aug 16, 2017 at 20:50

FabienP

3,1581 gold badge24 silver badges27 bronze badges

Comments

lfpicoloto · Accepted Answer · 2017-08-16 21:14:28Z

0

You can use this functions:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

df = df.dropna(axis=0, how='all')
df = df.dropna(axis=1, how='all')

writer = pd.ExcelWriter('out.xlsx')
df.to_excel(writer, 'out')
writer.save()

Before:

After:

answered Aug 16, 2017 at 21:14

lfpicoloto

3673 silver badges4 bronze badges

Collectives™ on Stack Overflow

Aligning data in python

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related