1

I am reading several .csv files (each file is a time-series with the date in column one (which I would like to index by), and the time series in column two. I can read in the data but it's all appended to the same column in the dataframe when I would like each file tohave its own column indexed by date:

So for example if I have 3 files (I have more than three in reality)

csv1
1/1/2016,1.1
2/1/2016,1.2
3/1/2016,1.6

csv2
1/1/2016,4.6
2/1/2016,31.2
3/1/2016,1.8

csv3
2/1/2016,3.2
3/1/2016,5.8

Currently I return:

0        1 
1/1/2016 1.1
2/1/2016 1.2
3/1/2016 1.6
1/1/2016 4.6
2/1/2016 31.2
3/1/2016 1.8
2/1/2016 3.2
3/1/2016 5.8

When I would like to return:

0        1   2   3
1/1/2016 1.1 4.6 null
2/1/2016 1.2 31.2 3.2
3/1/2016 1.6 1.8 5.8

My code at the moment looks like this:

def getData(rawDataPath): 
    big_frame = pd.DataFrame()
    path = rawDataPath
    allfiles = glob.glob(os.path.join(path,"*.csv"))


    np_array_list = []
    for file_ in allfiles:
        df = pd.read_csv(file_,index_col=None, header=0)
        np_array_list.append(df.as_matrix())

    comb_np_array = np.vstack(np_array_list)

    big_frame = big_frame.append(pd.DataFrame(comb_np_array))

    return big_frame

1 Answer 1

3

Since you already use DataFrame from pandas, might as well use pandas' join/merging functionality:

In [21]: csv1 = io.StringIO("""1/1/2016,1.1
2/1/2016,1.2
3/1/2016,1.6""")

In [22]: csv2 = io.StringIO("""1/1/2016,4.6
2/1/2016,31.2
3/1/2016,1.8""")

In [23]: csv3 = io.StringIO("""2/1/2016,3.2
3/1/2016,5.8""")

In [24]: df1 = pd.read_csv(csv1, header=None)

In [25]: df2 = pd.read_csv(csv2, header=None)

In [26]: df3 = pd.read_csv(csv3, header=None)

In [27]: pd.merge(pd.merge(df1, df2, on=0, how='outer'), df3, on=0, how='outer')
Out[27]: 
          0  1_x   1_y    1
0  1/1/2016  1.1   4.6  NaN
1  2/1/2016  1.2  31.2  3.2
2  3/1/2016  1.6   1.8  5.8

The example uses how='outer', which means a full outer join. That was chosen in case your data can have missing keys from file to file. If this is not the case, consider other strategies as suit you best.

In order to reduce all your files in a sane fashion you can for example do:

In [30]: from functools import partial, reduce

In [31]: reduce(partial(pd.merge, on=0, how='outer'), [df1, df2, df3])
Out[31]: 
          0  1_x   1_y    1
0  1/1/2016  1.1   4.6  NaN
1  2/1/2016  1.2  31.2  3.2
2  3/1/2016  1.6   1.8  5.8

Just replace the list with your own preloaded dataframes:

def getData(rawDataPath):
    path = rawDataPath
    allfiles = glob.glob(os.path.join(path, "*.csv"))
    dataframes = (pd.read_csv(fname, header=None, names=['date', fname])
                  for fname in allfiles)
    return reduce(partial(pd.merge, on='date', how='outer'), dataframes)
Sign up to request clarification or add additional context in comments.

5 Comments

thanks that's great! Is there a way to add the .csv file names as a column headers?
Hmm I think you can modify the column names afterwards at least by assigning to dframe.columns = ['date', 'csv1', 'csv2', 'csv3'] or so, or name your columns when creating the frames: pd.read_csv(csv1, names=['date', 'csv1'], header=None). That way there's no need to suffix common columns and the merged result will be fine as is.
Alternative (imo prettier) syntax to pd.merge(df1,df2,...) is df1.merge(df2, on=0, how='outer').merge(df3, on=0, how='outer') and wow the reduce(partial(... is pretty elegant! :)
@Pocin oh snap, should've known dataframes have merge as a method.
Cheers fellas that hit the spot

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.