pandas dataframe, setting index_col to my csv name

Question

I have a question in regards to using pd.read_csv I am currently building a dataframe from multiple csv files from a folder and the csv files are named as follows: "C2__1979H" or "C2_1999Z"

I would like to set the index of my dataFrame to equal the name of the CSV file it is currently pulling to create my dataframe. I have yet to find a way to do that. Here is my current code

my dataframe looks like this:

    Date     Open    High     Low   Close     Vol  OI  Roll
0   19780106  236.00  237.50  234.50  235.50    0   0     0
1   19780113  235.50  239.00  235.00  238.25    0   0     0
2   19780120  238.00  239.00  234.50  237.00    0   0     0
3   19780127  237.00  238.50  235.50  236.00    0   0     0

I want it to look like this

            Date       Open    High     Low   Close    Vol  OI  Roll
C2__1979N   19780106  236.00  237.50  234.50  235.50    0   0     0
C2__1979N   19780113  235.50  239.00  235.00  238.25    0   0     0
C2__1979N   19780120  238.00  239.00  234.50  237.00    0   0     0
C2__1979Z   19780127  237.00  238.50  235.50  236.00    0   0     0 ##(assuming this is where the next csv file began)

PLEASE NOTE, I know my index_col = None but I wouldnt know what to put that to anyway, ty — antonio_zeus
– antonio_zeus, Commented Sep 10, 2015 at 20:37
Just answered to your question, tell me if it fulfills your needs. — Romain
– Romain, Commented Sep 10, 2015 at 21:01
Is there a reason you need to do this? You're abusing the point of the index here, you can either build a dict of dfs, where the key is the csv name or add a field called 'csv_name', by doing what you desire you completely ruin the usefulness of the index — EdChum
– EdChum, Commented Sep 11, 2015 at 8:11
I am always open to other solutions. Feel free to post an answer below although Romain has done it, I am still open to other ways of doing things. That's how you learn ! TY — antonio_zeus
– antonio_zeus, Commented Sep 11, 2015 at 14:58

Romain · Accepted Answer · 2015-09-10 22:06:52Z

2

It does the trick.

import os

df_temp = pd.DataFrame({'Close': [235.5, 238.25, 237.0, 236.0],
 'Date': [19780106, 19780113, 19780120, 19780127],
 'High': [237.5, 239.0, 239.0, 238.5],
 'Low': [234.5, 235.0, 234.5, 235.5],
 'OI': [0, 0, 0, 0],
 'Open': [236.0, 235.5, 238.0, 237.0],
 'Roll': [0, 0, 0, 0],
 'Vol': [0, 0, 0, 0]})

df = pd.DataFrame()

# To simulate several df
x=0
for file_ in ['the_path/C2__1979N.csv', 'other_path/C2__1979H.csv']:
    filename, file_extension = os.path.splitext(file_)
    df_temp['name'] = os.path.basename(filename)
    df = df.append(df_temp.loc[x:x+1,:])
    x+=1

df.set_index('name', inplace=True)
df.index.name = None
print(df)

# Result
            Close      Date   High    Low  OI   Open  Roll  Vol
C2__1979N  235.50  19780106  237.5  234.5   0  236.0     0    0
C2__1979N  238.25  19780113  239.0  235.0   0  235.5     0    0
C2__1979H  237.00  19780120  239.0  234.5   0  238.0     0    0
C2__1979H  236.00  19780127  238.5  235.5   0  237.0     0    0

In the original code:

for file_ in allFiles:
    names = ['Date', 'Open', 'High', 'Low', 'Close', 'Vol', 'OI', 'Roll']
    df_temp = pd.read_csv(file_, index_col = None, names = names)
    df_temp['Roll'] = 0
    df_temp.iloc[-2,-1] = 1
    filename, file_extension = os.path.splitext(file_)
    df_temp['name'] = os.path.basename(filename)
    df = df.append(df_temp)

df = df.reset_index(drop=True)
df.set_index('name', inplace=True)
df.index.name = None
df = df[names]

df = df.drop_duplicates('Date') ## remove duplicate rows with same date

edited Sep 10, 2015 at 22:06

answered Sep 10, 2015 at 20:44

Romain

22.2k6 gold badges63 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

antonio_zeus Over a year ago

what if you dont know the file name? As in, if I ran this code on different folders that within have different # of csv's, different names for csv... then how would the code use that? right now it looks like I would need to type in the name of the file... am I correct?

Romain Over a year ago

When you loop through your files (allFiles) you have the filename (file_) right ? In this case you have simply to use it. I have typed file names manually just to simulate.

Romain Over a year ago

I have modified my answer to extract only the file name from the path.

hellpanderr · Accepted Answer · 2015-09-10 20:41:46Z

0

Have you tried the obvious one?

df_temp.index = [file_]*len(df_temp)

answered Sep 10, 2015 at 20:41

hellpanderr

5,9563 gold badges42 silver badges50 bronze badges

Collectives™ on Stack Overflow

pandas dataframe, setting index_col to my csv name

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related