0

I have two csv files which have different row numbers.

test1.csv
num,sam
1,1.2
2,1.13
3,0.99

test2.csv
num,sam
1,1.2
2,1.1
3,0.99
4,1.02

I would like to read the sam columns and append them to an empty dataframe. Thing is that, when I read test1.csv, I extract the base file name, test1 and want to append the sam column based on the `column header in the empty dataframe.

big_df = pd.DataFrame(columns =['test1','test2'])
pwd = os.getcwd()
for file in os.listdir(pwd):
     filename = os.fsdecode(file)
     if filename.endswith(".csv"):
         prog = filename.split('.')[0] # test1 test2
         df = pd.read_csv(filename, usecols=['sam'])
         # The read dataframe has one column
         # Move/append that column to the big_df where column == prog
         big_df[prog] = df
print(big_df)

But big_df misses the fourth row of test2.csv.

   test1  test2
0   1.20   1.20
1   1.13   1.1
2   0.99   0.99

I expect to see

   test1  test2
0   1.20   1.20
1   1.13   1.1
2   0.99   0.99
3   NaN    1.02

How can I fix that?

3
  • 1
    Read the dataframes then merge them as you please: Pandas Merging 101 Commented Feb 4, 2022 at 12:03
  • @Mr.T: I understand the concept, but with big_df.merge(df, how='outer') I get the error No common columns to perform merge on. So, that is because of different columns in big_df and df. Commented Feb 4, 2022 at 12:50
  • Can you specify if the 'num' columns should be taken into account? More specifically, what should happen when one of them is not a continuous sequence starting at 1? Commented Feb 4, 2022 at 15:36

3 Answers 3

2

Using pandas.concat and a simple dictionary comprehension:

files = ['test1.csv', 'test2.csv']
df = pd.concat({f.rsplit('.', 1)[0]: pd.read_csv(f).set_index('num')['sam']
                for f in files}, axis=1)

output:

     test1  test2
num              
1     1.20   1.20
2     1.13   1.10
3     0.99   0.99
4      NaN   1.02
Sign up to request clarification or add additional context in comments.

4 Comments

That is possible. But I was trying to read the files in a loop as the final code deals with large number of files. So, statically specifying the file names is not a preferred one.
@mahmood you can use other ways to feed in the files. glob for instance or a function/generator
@mozway, would n't it be more memory efficient to use a generator instead of a dict , as: (pd.read_csv(f).set_index('num')['sam'].rename(f.split('.')[0]). If there are a large number of files it seems it could make a difference.
@hpchavaz I haven't tested but I imagine that it should. You can also pass the labels as parameters to concat.
1

You could approach it differently and use concat instead of creating an empty data frame in the first place. Might be also a bit more efficient. In code that reads like

def get_columns():
    for file in os.listdir(pwd):
        filename = os.fsdecode(file)
        if filename.endswith(".csv"):
            prog = filename.split('.')[0] # test1 test2
            yield pd.read_csv(filename, usecols=['sam'])['sam'].rename(prog)
big_df = pd.concat(get_columns(), axis=1)

Otherwise, you could use merge with outer as mentioned in the comment.

1 Comment

The question is slightly ambiguous. The answer is correct if the 'num' columns have no meaning. On the other hand, if they are to be taken into account, and one of the 'num' columns is not a continuous sequence starting at 1, the result could be different from what is desired.
0

read bothe csv's first using pd.read_csv("filename",sep=',')

df1
   num   sam
0    1  1.20
1    2  1.13
2    3  0.99

df2
   num   sam
0    1  1.20
1    2  1.10
2    3  0.99
3    4  1.02

the do the following

df2.drop('num',axis=1,inplace=True)
df2.columns=['test2']
df2['test1']=df1['sam']

output:

   test2  test1
0   1.20   1.20
1   1.10   1.13
2   0.99   0.99
3   1.02    NaN

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.