how to use Pandas to only read excel header?

Question

I know read excel table with pandas:

import pandas as pd

table = pd.read_excel(io)

After loading the data, if I want to get the table header:

table.columns

This method is feasible, but sometimes I just want to get the header of the excel table directly, especially when the excel table has a large body size, it will be very time-consuming to load the data table into the memory & it is also unnecessary, sometimes it even overflows directly and gets stuck. Looking at the official documents, it seems that I can use the nrows parameter to specify that only specific lines of Excel can be read, This means that I can use it to read only the first row header:

header = pd.read_excel(io, nrows = 0)

However, I found that also can not prevent pandas read the whole excel data, and it will still consume a lot of time and memory. Do you have good experience in dealing with this problem?

Does this answer your question? Reading column names alone in a csv file — Suraj
– Suraj, Commented Mar 27, 2020 at 4:02
So only the file extension changes, try that code after changing the file extension. — Suraj
– Suraj, Commented Mar 27, 2020 at 6:17

denis · Accepted Answer · 2021-06-08 08:49:34Z

This function sheet_rows uses openpyxl directly, not pandas; it's much faster than read_excel( nrows=0 ), and simple:

#!/usr/bin/env python3

import openpyxl  # https://openpyxl.readthedocs.io

#...............................................................................
def sheet_rows( sheet, nrows=3, ncols=None, verbose=5 ) -> "list of lists":
    """ openpyxl sheet -> the first `nrows` rows x `ncols` columns
        verbose=5: print A1 .. A5, E1 .. E5 as lists
    """
    rows = sheet.iter_rows( max_row=nrows, max_col=ncols, values_only=True )
    rows = [list(r) for r in rows]  # generator -> list of lists
    if verbose:
        print( "\n-- %s  %d rows  %d cols" % (
                sheet.title, sheet.max_row, sheet.max_column ))
        for row in rows[:verbose]:
            trimNone = list( filter( None, row[:verbose] ))
            print( trimNone )
    return rows


# xlsxin = sys.argv[1]
wb = openpyxl.load_workbook( xlsxin, read_only=True )
print( "\n-- openpyxl.load_workbook( \"%s\" )" % xlsxin )

for sheetname in wb.sheetnames:
    sheet = wb[sheetname]

    rows = sheet_rows( sheet, nrows=nrows )

    df = (pd.DataFrame( rows )  # index= columns=
            .dropna( axis="index", how="all" )
            .dropna( axis="columns", how="all" ) 
            )
    print( df )
    # df.to_excel df.to_csv ...

"Partial read" under pyexcel explains that most Excel readers read ALL the data into memory before doing anything else -- slow. openpyxl iter_rows() gets a few rows or columns fast, memory don't know.

Hadar · Accepted Answer · 2022-10-12 01:15:52Z

2

import pandas as pd 
Frame = pd.read_excel("/content/data.xlsx", header=0)
print(Frame.head(0))

This will give you the header only, assuming the header is on row 1. If no value is entered, the default value of 5 is assumed. Hence that is why you obtain multiple lines.

edited Oct 12, 2022 at 1:15

Hadar

6885 silver badges19 bronze badges

answered Oct 7, 2022 at 19:08

Aziz A. Rasul

212 bronze badges

Comments

Mitali Patel · Accepted Answer · 2020-03-27 05:07:15Z

0

import pandas as pd 

Frame=pd.read_excel("/content/data.xlsx" , header=0)
Frame.head()

answered Mar 27, 2020 at 5:07

Mitali Patel

669 bronze badges

1 Comment

Xu Zhoufeng Over a year ago

thanks, but I just want to get the header only, This method still takes a lot of time and memory to read the whole data

Collectives™ on Stack Overflow

how to use Pandas to only read excel header?

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related