13

I have a list of dataframes

dfA:

item   a     A              
A      1     2 
B      1     3         
C      0     4         

dfB:

item   a     B
E      1     2
F      0     6

dfC:

item   a     C
G      1     3
H      0     4

I want to add the missing columns to each dataframe

This is what I want : dfA:

item   a     A    B    C           
A      1     2    0    0
B      1     3    0    0 
C      0     4    0    0 

dfB:

item   a     A   B    C
E      1     0   2    0
F      0     0   6    0

dfC:

item   a     A   B   C
G      1     0   0   3
H      0     0   0   4
1

5 Answers 5

15

1) Take the union of each dataframe's columns.

col_list = list(set().union(dfA.columns, dfB.columns, dfC.columns))
col_list.sort()
['A', 'B', 'C', 'a']

2) Use the reindex function.

dfA2 = dfA.reindex(columns=col_list, fill_value=0)
   A  B  C  a
A  2  0  0  1
B  3  0  0  1
C  4  0  0  0

dfB2 = dfB.reindex(columns=col_list, fill_value=0)
   A  B  C  a
E  0  2  0  1
F  0  6  0  0

dfC2 = dfC.reindex(columns=col_list, fill_value=0)
   A  B  C  a
G  0  0  3  1
H  0  0  4  0

3) You can use reindex to drop, add, or duplicate columns.

dfA3 = dfA.reindex(columns=['C', 'A', 'A', 'D'], fill_value=0)
   C  A  A  D
A  0  2  2  0
B  0  3  3  0
C  0  4  4  0
Sign up to request clarification or add additional context in comments.

Comments

5

You can create a combined column list like this

col_list = (df1.append([df2,df3])).columns.tolist()

Now add the columns to each dataframe

df1 = df1.loc[:, col_list].fillna(0)
print(df1)

    A   B   C   a   item
0   2   0.0 0.0 1   A
1   3   0.0 0.0 1   B
2   4   0.0 0.0 0   C


df2 = df2.loc[:, col_list].fillna(0)
print(df2)

    A   B   C   a   item
0   0.0 2   0.0 1   E
1   0.0 6   0.0 0   F

df3 = df3.loc[:, col_list].fillna(0)
print(df3)

    A   B   C   a   item
0   0.0 0.0 3   1   G
1   0.0 0.0 4   0   H

3 Comments

I think this behavior is deprecated (raises KeyError) as of pandas 0.21. Use df.reindex() instead. pandas.pydata.org/pandas-docs/stable/user_guide/…
Also, df.append is also depreciated.
This no longer works. See this answer, using .reindex, instead.
3

Option 1
Align both axes
With functools.partial

from functool import partial

(_, dfA), (dfC, dfB) = list(map(
    partial(dfC.align, fill_value=0),
    dfA.align(dfB, fill_value=0)
))

Option 1B
Align columns only

from functools import partial

(_, dfA), (dfC, dfB) = list(map(
    partial(dfC.align, fill_value=0, axis=1),
    dfA.align(dfB, fill_value=0, axis=1)
))

Option 2
Align both axes
With pd.DataFrame.reindex

from functools import reduce    

lod = [dfA, dfB, dfC]
idx = reduce(pd.Index.union, (d.index for d in lod))
col = reduce(pd.Index.union, (d.columns for d in lod))
dfA, dfB, dfC = (d.reindex(idx, col, fill_value=0) for d in lod)

Option 2B
Align columns only

lod = [dfA, dfB, dfC]
col = reduce(pd.Index.union, (d.columns for d in lod))
dfA, dfB, dfC = (d.reindex(columns=col, fill_value=0) for d in lod)

Setup

dfA = pd.DataFrame(**{
    'columns': ['item', 'a', 'A'],
    'data': [['A', 1, 2], ['B', 1, 3], ['C', 0, 4]],
    'index': [0, 1, 2]})

dfB = pd.DataFrame(**{
    'columns': ['item', 'a', 'B'],
    'data': [['E', 1, 2], ['F', 0, 6]],
    'index': [0, 1]})

dfC = pd.DataFrame(**{
    'columns': ['item', 'a', 'C'],
    'data': [['G', 1, 3], ['H', 0, 4]],
    'index': [0, 1]})

Comments

2

One way using merge by defining the order of dfA, dfB, dfC in reduce operation.

In [1932]: reduce(lambda l,r: pd.merge(l,r,on=['item', 'a'], how='left'),
                              [dfA, dfB, dfC]).fillna(0)
Out[1932]:
  item  a  A    B    C
0    A  1  2  0.0  0.0
1    B  1  3  0.0  0.0
2    C  0  4  0.0  0.0

In [1933]: reduce(lambda l,r: pd.merge(l,r,on=['item', 'a'], how='left'), 
                  [dfB, dfA, dfC]).fillna(0)
Out[1933]:
  item  a  B    A    C
0    E  1  2  0.0  0.0
1    F  0  6  0.0  0.0

In [1934]: reduce(lambda l,r: pd.merge(l,r,on=['item', 'a'], how='left'),
                  [dfC, dfA, dfB]).fillna(0)
Out[1934]:
  item  a  C    A    B
0    G  1  3  0.0  0.0
1    H  0  4  0.0  0.0

Comments

0

You can use columns union and difference:

dfA.loc[:, dfB.columns.union(dfC.columns).difference(dfA.columns)] = 0
print(dfA)

Output:

      a  A  B  C
item            
A     1  2  0  0
B     1  3  0  0
C     0  4  0  0

You can also use a for loop:

dfs = [dfA, dfB, dfC]
all_columns = set().union(*[df.columns for df in dfs])
for df in dfs:
    df.loc[:, all_columns.difference(df.columns)] = 0
    print(df, end='\n\n')

Output:

      a  A  B  C
item            
A     1  2  0  0
B     1  3  0  0
C     0  4  0  0

      a  B  A  C
item            
E     1  2  0  0
F     0  6  0  0

      a  C  A  B
item            
G     1  3  0  0
H     0  4  0  0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.