pandas read_excel multiple tables on the same sheet

Question

Is it possible to read multiple tables from a sheet excel file using pandas ? Something like: read table1 from row0 until row100 read table2 from row 102 until row202 ...

Why not just read it all in and then separate to different DataFrames in python? — splinter
– splinter, Commented Apr 12, 2017 at 11:12

Rotem · Accepted Answer · 2019-09-13 21:37:05Z

25

I wrote the following code to identify the multiple tables automatically, in case you have many files you need to process and don't want to look in each one to get the right row numbers. The code also looks for non-empty rows above each table and reads those as table metadata.

def parse_excel_sheet(file, sheet_name=0, threshold=5):
    '''parses multiple tables from an excel sheet into multiple data frame objects. Returns [dfs, df_mds], where dfs is a list of data frames and df_mds their potential associated metadata'''
    xl = pd.ExcelFile(file)
    entire_sheet = xl.parse(sheet_name=sheet_name)

    # count the number of non-Nan cells in each row and then the change in that number between adjacent rows
    n_values = np.logical_not(entire_sheet.isnull()).sum(axis=1)
    n_values_deltas = n_values[1:] - n_values[:-1].values

    # define the beginnings and ends of tables using delta in n_values
    table_beginnings = n_values_deltas > threshold
    table_beginnings = table_beginnings[table_beginnings].index
    table_endings = n_values_deltas < -threshold
    table_endings = table_endings[table_endings].index
    if len(table_beginnings) < len(table_endings) or len(table_beginnings) > len(table_endings)+1:
        raise BaseException('Could not detect equal number of beginnings and ends')

    # look for metadata before the beginnings of tables
    md_beginnings = []
    for start in table_beginnings:
        md_start = n_values.iloc[:start][n_values==0].index[-1] + 1
        md_beginnings.append(md_start)

    # make data frames
    dfs = []
    df_mds = []
    for ind in range(len(table_beginnings)):
        start = table_beginnings[ind]+1
        if ind < len(table_endings):
            stop = table_endings[ind]
        else:
            stop = entire_sheet.shape[0]
        df = xl.parse(sheet_name=sheet_name, skiprows=start, nrows=stop-start)
        dfs.append(df)

        md = xl.parse(sheet_name=sheet_name, skiprows=md_beginnings[ind], nrows=start-md_beginnings[ind]-1).dropna(axis=1)
        df_mds.append(md)
    return dfs, df_mds

answered Sep 13, 2019 at 21:37

Rotem

2413 silver badges2 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

bl79 Over a year ago

What is df_mds list? Because all df placing into dfs.

ArKan Over a year ago

Nice work @Rotem

Shihab Ullah Over a year ago

Throws the following exception for me : ValueError: 'nrows' must be an integer >=0

NoobVB Over a year ago

whats the threshold in this function?

ZdWhite Over a year ago

@NoobVB The threshold variable is a guess on what determines the difference between a table beginning and ending. It's a finicky heuristic in this sense. the variables n_values and n_values_deltas give some hints as they are a sum of all the elements in a row that ARE NOT NULL, followed by the difference between the previous row and the current row. It is in essence a form of "edge detection" where you can imagine a step function of index on the x axis and the deltas on the y axis the threshold is a constant line on the positive x and - x axis. When crossed a new table is defined.

|

123 · Accepted Answer · 2018-06-19 23:14:18Z

18

Assuming we have the following Excel file:

Solution: we are parsing the first sheet (index: 0)

xl = pd.ExcelFile(fn)
nrows = xl.book.sheet_by_index(0).nrows

df1 = xl.parse(0, skipfooter= nrows-(10+1)).dropna(axis=1, how='all')
df2 = xl.parse(0, skiprows=12).dropna(axis=1, how='all')

EDIT: skip_footer was replaced with skipfooter

Result:

In [123]: df1
Out[123]:
    a   b   c
0  78  68  33
1  62  26  30
2  99  35  13
3  73  97   4
4  85   7  53
5  80  20  95
6  40  52  96
7  36  23  76
8  96  73  37
9  39  35  24

In [124]: df2
Out[124]:
   c1  c2  c3 c4
0  78  88  59  a
1  82   4  64  a
2  35   9  78  b
3   0  11  23  b
4  61  53  29  b
5  51  36  72  c
6  59  36  45  c
7   7  64   8  c
8   1  83  46  d
9  30  47  84  d

edited Jun 19, 2018 at 23:14

123

87310 silver badges21 bronze badges

answered Apr 12, 2017 at 11:56

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

3 Comments

Vishal Kumar Sahu Over a year ago

Can it be done dynamicaly?

MaxU - stand with Ukraine Over a year ago

@VishalKumarSahu, yes, it can. Search for the row that separates the data, and use its index...

Vishal Kumar Sahu Over a year ago

I looped each rows took for the desired values and used those to list indexes to separate the table.

splinter · Accepted Answer · 2017-04-12 11:33:07Z

5

First read in the entire csv file:

import pandas as pd
df = pd.read_csv('path_to\\your_data.csv')

and then obtain the individual frames, for example using:

df1 = df.iloc[:100,:]
df2 = df.iloc[100:200,:]

answered Apr 12, 2017 at 11:33

splinter

3,92712 gold badges45 silver badges86 bronze badges

1 Comment

MaxU - stand with Ukraine Over a year ago

if it would be a CSV file we could simply use skiprows and nrows parameters. Unfortunately the nrows is not implemented for pd.read_excel

Collectives™ on Stack Overflow

pandas read_excel multiple tables on the same sheet

3 Answers 3

7 Comments

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related