User Input pd.read_excel gives "ValueError: Invalid file path or buffer object type" - Pandas

Question

I have list of excel files that are read into pandas dataframes. However, some files (dataframes) have different headers in different rows. Therefore, I would like to have a user input, which will help me to set dataframe headers for each DataFrame.

Lets say my first (Excel file) dataframe looks like this,

0   245                 867               
1   Reddit              Facebook          
2   ColumnNeeded        ColumnNeeded      
3   RedditInsight       FacbookInsights   
4   RedditText          FacbookText

Now, I want to the user to look at this and then input row 2 (index 1) as the number, then my output dataframe will be like this,

    Reddit              Facebook          
0   ColumnNeeded        ColumnNeeded      
1   RedditInsight       FacbookInsights   
2   RedditText          FacbookText

This way, I can create headers for each dataframe.

This is how I have,

excel_file_dfs = []

for file in glob.glob(r'path\*.xlsx'):
    df = pd.read_excel(file)

## Not sure how to show the DataFrame here so, user can select the row to be the header

    ask_user = input("What raw do you want to make it header? ")
    header_number = ask_user
    df = pd.read_excel(file, header=[header_number])
    excel_file_dfs.append(df)

I am getting this error:

ValueError: Invalid file path or buffer object type:

from line df = pd.read_excel(each_file, header=[ask_user]).

I know I am reading pd.read_excel() two times, which might be causing lot of memory and processing.

Anyhow, I want the user to see each DataFrame and then input the row number to select the header. How can I do it in pandas?

Anton vBR · Accepted Answer · 2018-07-10 21:00:22Z

1

How many rows down can the header be? Let us assume it is within the first 5: Would this approach make sense?

import pandas as pd

data = '''\
245                 867               
Reddit              Facebook          
ColumnNeeded        ColumnNeeded      
RedditInsight       FacbookInsights   
RedditText          FacbookText
'''

fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep='\s+', header=None)

print(df.head(5))

inp = input('Which row is header?')
n = int(inp)

df.columns = df.loc[n].values
df = df.loc[n+1:]
print(df)

Or define a function with a loop:

def change_header(df, i=5):
    n = 0
    while True:
        print(df.loc[n:n+i])
        inp = input('Which row is header? (number or (n)ext or (r)estart)')
        if inp.isdigit():
            n = int(inp)
            if n < len(df):
                break
            else:
                n = 0
                print('error')
                continue
        elif inp.lower().startswith('r'):
            n = 0
            continue
        elif inp.lower().startswith('n'):
            if (n+i) < len(df):
                n += i
            continue
        else:
            print('Try something else')

    df.columns = df.loc[n].values
    df = df.loc[n+1:]
    return df

df = change_header(df, 5)

edited Jul 10, 2018 at 21:00

answered Jul 10, 2018 at 20:47

Anton vBR

19k6 gold badges47 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

user9431057 Over a year ago

I was thinking the df.head(20) up to 20 rows! Let me try this!

Anton vBR Over a year ago

@user9431057 We could build a loop too.

Anton vBR Over a year ago

@user9431057 Added a loop that shows 5 rows at a time in a function.

user9431057 Over a year ago

thnx, I am new to python and some of this I don't understand or overwhelming for me. Let me give it a shot at it!

Anton vBR Over a year ago

@user9431057 Sorry for that. Basically I'm just taking care of the fact that you want to see only 5 rows at a time. There is nothing really special about the code. It is really basic if you take your time and read it row-by-row. Python is a lot about reading code that others made! welcome!

|

Satya Prakash Dash · Accepted Answer · 2018-07-10 20:55:58Z

0

You can use os library and call the files like this:

import os
import pandas as pd
excel_file_dfs = []
directory = 'C:/your_directory_here'
for filename in os.listdir(directory):
    if filename.endswith('.xlsx'):
        header_number = print('Enter row number you want to make header: ')
        df = pd.read_excel(filename, header=int(header_number))
        excel_file_dfs.append(df)
final_df = pd.concat(excel_file_dfs)
final_df

This way initially you can ask for headers and the take the os and call for the directory and take all the excel sheets. Hope it cleared your question. :)

edited Jul 10, 2018 at 20:55

answered Jul 10, 2018 at 19:49

Satya Prakash Dash

1,3261 gold badge12 silver badges20 bronze badges

5 Comments

Satya Prakash Dash Over a year ago

So like in an excel sheet there are 11 columns and you want only 5 of them is it what you are asking for?

Satya Prakash Dash Over a year ago

So, from column I mean going down and row I mean going right and usually the header lies on the top of the first column so if you wanted to put the header name then go ahead or if you don't want to put header name then you can add header = True in the read_excel() function then the prompt asks for the required header files and you can put an inteager into it.

Satya Prakash Dash Over a year ago

Doing this: df = pd.read_csv('E:/algo/data/combining/FINAL_5DAY_DATA.csv',header=2) in one of my files makes the 1st row header.

user9431057 Over a year ago

agree, but what if you have a file that the user wants 5th row to be the header??

Satya Prakash Dash Over a year ago

Let us continue this discussion in chat.

Collectives™ on Stack Overflow

User Input pd.read_excel gives "ValueError: Invalid file path or buffer object type" - Pandas

2 Answers 2

7 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related