1

I have list of excel files that are read into pandas dataframes. However, some files (dataframes) have different headers in different rows. Therefore, I would like to have a user input, which will help me to set dataframe headers for each DataFrame.

Lets say my first (Excel file) dataframe looks like this,

0   245                 867               
1   Reddit              Facebook          
2   ColumnNeeded        ColumnNeeded      
3   RedditInsight       FacbookInsights   
4   RedditText          FacbookText             

Now, I want to the user to look at this and then input row 2 (index 1) as the number, then my output dataframe will be like this,

    Reddit              Facebook          
0   ColumnNeeded        ColumnNeeded      
1   RedditInsight       FacbookInsights   
2   RedditText          FacbookText

This way, I can create headers for each dataframe.

This is how I have,

excel_file_dfs = []

for file in glob.glob(r'path\*.xlsx'):
    df = pd.read_excel(file)

## Not sure how to show the DataFrame here so, user can select the row to be the header

    ask_user = input("What raw do you want to make it header? ")
    header_number = ask_user
    df = pd.read_excel(file, header=[header_number])
    excel_file_dfs.append(df)

I am getting this error:

ValueError: Invalid file path or buffer object type:

from line df = pd.read_excel(each_file, header=[ask_user]).

I know I am reading pd.read_excel() two times, which might be causing lot of memory and processing.

Anyhow, I want the user to see each DataFrame and then input the row number to select the header. How can I do it in pandas?

2 Answers 2

1

How many rows down can the header be? Let us assume it is within the first 5: Would this approach make sense?

import pandas as pd

data = '''\
245                 867               
Reddit              Facebook          
ColumnNeeded        ColumnNeeded      
RedditInsight       FacbookInsights   
RedditText          FacbookText
'''

fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep='\s+', header=None)

print(df.head(5))

inp = input('Which row is header?')
n = int(inp)

df.columns = df.loc[n].values
df = df.loc[n+1:]
print(df)

Or define a function with a loop:

def change_header(df, i=5):
    n = 0
    while True:
        print(df.loc[n:n+i])
        inp = input('Which row is header? (number or (n)ext or (r)estart)')
        if inp.isdigit():
            n = int(inp)
            if n < len(df):
                break
            else:
                n = 0
                print('error')
                continue
        elif inp.lower().startswith('r'):
            n = 0
            continue
        elif inp.lower().startswith('n'):
            if (n+i) < len(df):
                n += i
            continue
        else:
            print('Try something else')

    df.columns = df.loc[n].values
    df = df.loc[n+1:]
    return df

df = change_header(df, 5)
Sign up to request clarification or add additional context in comments.

7 Comments

I was thinking the df.head(20) up to 20 rows! Let me try this!
@user9431057 We could build a loop too.
@user9431057 Added a loop that shows 5 rows at a time in a function.
thnx, I am new to python and some of this I don't understand or overwhelming for me. Let me give it a shot at it!
@user9431057 Sorry for that. Basically I'm just taking care of the fact that you want to see only 5 rows at a time. There is nothing really special about the code. It is really basic if you take your time and read it row-by-row. Python is a lot about reading code that others made! welcome!
|
0

You can use os library and call the files like this:

import os
import pandas as pd
excel_file_dfs = []
directory = 'C:/your_directory_here'
for filename in os.listdir(directory):
    if filename.endswith('.xlsx'):
        header_number = print('Enter row number you want to make header: ')
        df = pd.read_excel(filename, header=int(header_number))
        excel_file_dfs.append(df)
final_df = pd.concat(excel_file_dfs)
final_df

This way initially you can ask for headers and the take the os and call for the directory and take all the excel sheets. Hope it cleared your question. :)

5 Comments

So like in an excel sheet there are 11 columns and you want only 5 of them is it what you are asking for?
So, from column I mean going down and row I mean going right and usually the header lies on the top of the first column so if you wanted to put the header name then go ahead or if you don't want to put header name then you can add header = True in the read_excel() function then the prompt asks for the required header files and you can put an inteager into it.
Doing this: df = pd.read_csv('E:/algo/data/combining/FINAL_5DAY_DATA.csv',header=2) in one of my files makes the 1st row header.
agree, but what if you have a file that the user wants 5th row to be the header??

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.