concatenate dataframes with variable row sizes

Question

I have two csv files which have different row numbers.

test1.csv
num,sam
1,1.2
2,1.13
3,0.99

test2.csv
num,sam
1,1.2
2,1.1
3,0.99
4,1.02

I would like to read the sam columns and append them to an empty dataframe. Thing is that, when I read test1.csv, I extract the base file name, test1 and want to append the sam column based on the `column header in the empty dataframe.

big_df = pd.DataFrame(columns =['test1','test2'])
pwd = os.getcwd()
for file in os.listdir(pwd):
     filename = os.fsdecode(file)
     if filename.endswith(".csv"):
         prog = filename.split('.')[0] # test1 test2
         df = pd.read_csv(filename, usecols=['sam'])
         # The read dataframe has one column
         # Move/append that column to the big_df where column == prog
         big_df[prog] = df
print(big_df)

But big_df misses the fourth row of test2.csv.

   test1  test2
0   1.20   1.20
1   1.13   1.1
2   0.99   0.99

I expect to see

   test1  test2
0   1.20   1.20
1   1.13   1.1
2   0.99   0.99
3   NaN    1.02

How can I fix that?

Read the dataframes then merge them as you please: Pandas Merging 101 — Mr. T
– Mr. T, Commented Feb 4, 2022 at 12:03
@Mr.T: I understand the concept, but with big_df.merge(df, how='outer') I get the error No common columns to perform merge on. So, that is because of different columns in big_df and df. — mahmood
– mahmood, Commented Feb 4, 2022 at 12:50
Can you specify if the 'num' columns should be taken into account? More specifically, what should happen when one of them is not a continuous sequence starting at 1? — hpchavaz
– hpchavaz, Commented Feb 4, 2022 at 15:36

mozway · Accepted Answer · 2022-02-04 12:19:30Z

2

Using pandas.concat and a simple dictionary comprehension:

files = ['test1.csv', 'test2.csv']
df = pd.concat({f.rsplit('.', 1)[0]: pd.read_csv(f).set_index('num')['sam']
                for f in files}, axis=1)

output:

     test1  test2
num              
1     1.20   1.20
2     1.13   1.10
3     0.99   0.99
4      NaN   1.02

answered Feb 4, 2022 at 12:19

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

mahmood Over a year ago

That is possible. But I was trying to read the files in a loop as the final code deals with large number of files. So, statically specifying the file names is not a preferred one.

mozway Over a year ago

@mahmood you can use other ways to feed in the files. glob for instance or a function/generator

hpchavaz Over a year ago

@mozway, would n't it be more memory efficient to use a generator instead of a dict , as: (pd.read_csv(f).set_index('num')['sam'].rename(f.split('.')[0]). If there are a large number of files it seems it could make a difference.

mozway Over a year ago

@hpchavaz I haven't tested but I imagine that it should. You can also pass the labels as parameters to concat.

Simon Hawe · Accepted Answer · 2022-02-04 12:06:07Z

1

You could approach it differently and use concat instead of creating an empty data frame in the first place. Might be also a bit more efficient. In code that reads like

def get_columns():
    for file in os.listdir(pwd):
        filename = os.fsdecode(file)
        if filename.endswith(".csv"):
            prog = filename.split('.')[0] # test1 test2
            yield pd.read_csv(filename, usecols=['sam'])['sam'].rename(prog)
big_df = pd.concat(get_columns(), axis=1)

Otherwise, you could use merge with outer as mentioned in the comment.

answered Feb 4, 2022 at 12:06

Simon Hawe

4,63913 silver badges21 bronze badges

1 Comment

hpchavaz Over a year ago

The question is slightly ambiguous. The answer is correct if the 'num' columns have no meaning. On the other hand, if they are to be taken into account, and one of the 'num' columns is not a continuous sequence starting at 1, the result could be different from what is desired.

gilf0yle · Accepted Answer · 2022-02-04 12:18:04Z

0

read bothe csv's first using pd.read_csv("filename",sep=',')

df1
   num   sam
0    1  1.20
1    2  1.13
2    3  0.99

df2
   num   sam
0    1  1.20
1    2  1.10
2    3  0.99
3    4  1.02

the do the following

df2.drop('num',axis=1,inplace=True)
df2.columns=['test2']
df2['test1']=df1['sam']

output:

   test2  test1
0   1.20   1.20
1   1.10   1.13
2   0.99   0.99
3   1.02    NaN

answered Feb 4, 2022 at 12:18

gilf0yle

1,1026 silver badges13 bronze badges

Collectives™ on Stack Overflow

concatenate dataframes with variable row sizes

3 Answers 3

4 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related