5

I know read excel table with pandas:

import pandas as pd

table = pd.read_excel(io)

After loading the data, if I want to get the table header:

table.columns

This method is feasible, but sometimes I just want to get the header of the excel table directly, especially when the excel table has a large body size, it will be very time-consuming to load the data table into the memory & it is also unnecessary, sometimes it even overflows directly and gets stuck. Looking at the official documents, it seems that I can use the nrows parameter to specify that only specific lines of Excel can be read, This means that I can use it to read only the first row header:

header = pd.read_excel(io, nrows = 0)

However, I found that also can not prevent pandas read the whole excel data, and it will still consume a lot of time and memory. Do you have good experience in dealing with this problem?

4
  • Does this answer your question? Reading column names alone in a csv file Commented Mar 27, 2020 at 4:02
  • No, xlsx file is different Commented Mar 27, 2020 at 6:14
  • So only the file extension changes, try that code after changing the file extension. Commented Mar 27, 2020 at 6:17
  • 1
    have a look at this library and see if it helps pyexcel Commented Mar 27, 2020 at 7:42

3 Answers 3

2

This function sheet_rows uses openpyxl directly, not pandas; it's much faster than read_excel( nrows=0 ), and simple:

#!/usr/bin/env python3

import openpyxl  # https://openpyxl.readthedocs.io

#...............................................................................
def sheet_rows( sheet, nrows=3, ncols=None, verbose=5 ) -> "list of lists":
    """ openpyxl sheet -> the first `nrows` rows x `ncols` columns
        verbose=5: print A1 .. A5, E1 .. E5 as lists
    """
    rows = sheet.iter_rows( max_row=nrows, max_col=ncols, values_only=True )
    rows = [list(r) for r in rows]  # generator -> list of lists
    if verbose:
        print( "\n-- %s  %d rows  %d cols" % (
                sheet.title, sheet.max_row, sheet.max_column ))
        for row in rows[:verbose]:
            trimNone = list( filter( None, row[:verbose] ))
            print( trimNone )
    return rows


# xlsxin = sys.argv[1]
wb = openpyxl.load_workbook( xlsxin, read_only=True )
print( "\n-- openpyxl.load_workbook( \"%s\" )" % xlsxin )

for sheetname in wb.sheetnames:
    sheet = wb[sheetname]

    rows = sheet_rows( sheet, nrows=nrows )

    df = (pd.DataFrame( rows )  # index= columns=
            .dropna( axis="index", how="all" )
            .dropna( axis="columns", how="all" ) 
            )
    print( df )
    # df.to_excel df.to_csv ...

"Partial read" under pyexcel explains that most Excel readers read ALL the data into memory before doing anything else -- slow. openpyxl iter_rows() gets a few rows or columns fast, memory don't know.

Sign up to request clarification or add additional context in comments.

Comments

2
import pandas as pd 
Frame = pd.read_excel("/content/data.xlsx", header=0)
print(Frame.head(0))

This will give you the header only, assuming the header is on row 1. If no value is entered, the default value of 5 is assumed. Hence that is why you obtain multiple lines.

Comments

0
import pandas as pd 

Frame=pd.read_excel("/content/data.xlsx" , header=0)
Frame.head()

1 Comment

thanks, but I just want to get the header only, This method still takes a lot of time and memory to read the whole data

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.