Pandas df.describe doesn't work after adding new column

Question

I've got a Pandas dataframe with 118 columns and I'd like to add a new column 'x119'. I tried using various methods which all seem to work like:

df = df.assign(x119=F))

or:

df.loc[:,'x119'] = F

The methods seem to add the column to the df dataframe but when I use:

df.describe()

I still get 118 columns. Has anyone encountered this situation? The column seem to exist when calling df['x119'] but not shown in the description of df.describe().

EDIT: The values of F are categorical with numeric values of 1,2,3. The column 'x119' did not exist in df before and when I use df2=df and then df2.decribe() it works fine and I can see all columns.

It is categorical data with numeric labels: 1, 2, 3

AR_
– AR_

2017-09-06 06:06:53 +00:00
Commented Sep 6, 2017 at 6:06 — AR_
– AR_, Commented Sep 6, 2017 at 6:06

Mohamed Ali JAMAOUI · Accepted Answer · 2019-07-13 09:23:55Z

Case 1: all datatypes are numeric:

df.describe() works fine after df.assign(..) for numeric datatypes, here's a reproducible example:

>>> df = pd.DataFrame([[1,2],[3,4]], columns=list('AB'))
>>> df
   A  B
0  1  2
1  3  4
>>> import numpy as np 
>>> df["C"] = np.nan 
>>> df
   A  B   C
0  1  2 NaN
1  3  4 NaN
>>> df.describe()
              A         B    C
count  2.000000  2.000000  0.0
mean   2.000000  3.000000  NaN
std    1.414214  1.414214  NaN
min    1.000000  2.000000  NaN
25%    1.500000  2.500000  NaN
50%    2.000000  3.000000  NaN
75%    2.500000  3.500000  NaN
max    3.000000  4.000000  NaN
>>> df.assign(D=5)
   A  B   C  D
0  1  2 NaN  5
1  3  4 NaN  5
>>> df.describe()
              A         B    C
count  2.000000  2.000000  0.0
mean   2.000000  3.000000  NaN
std    1.414214  1.414214  NaN
min    1.000000  2.000000  NaN
25%    1.500000  2.500000  NaN
50%    2.000000  3.000000  NaN
75%    2.500000  3.500000  NaN
max    3.000000  4.000000  NaN
>>> df  = df.assign(D=5)
>>> df.describe()
              A         B    C    D
count  2.000000  2.000000  0.0  2.0
mean   2.000000  3.000000  NaN  5.0
std    1.414214  1.414214  NaN  0.0
min    1.000000  2.000000  NaN  5.0
25%    1.500000  2.500000  NaN  5.0
50%    2.000000  3.000000  NaN  5.0
75%    2.500000  3.500000  NaN  5.0
max    3.000000  4.000000  NaN  5.0
>>>

Make sure you assign the result of df.assign back to df like df= df.assign(...)

Case 2: mixed numeric and object datatypes:

For mixed object and numeric datatypes, you need to do df.describe(include='all') as mentioned in the Notes section from the documentation here:

For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

>>> df["E"] = ['1','2']
>>> df
   A  B   C  D  E
0  1  2 NaN  5  1
1  3  4 NaN  5  2
>>> df.describe()
              A         B    C    D
count  2.000000  2.000000  0.0  2.0
mean   2.000000  3.000000  NaN  5.0
std    1.414214  1.414214  NaN  0.0
min    1.000000  2.000000  NaN  5.0
25%    1.500000  2.500000  NaN  5.0
50%    2.000000  3.000000  NaN  5.0
75%    2.500000  3.500000  NaN  5.0
max    3.000000  4.000000  NaN  5.0
>>> df
   A  B   C  D  E
0  1  2 NaN  5  1
1  3  4 NaN  5  2
>>>

so you need to call describe as follows:

>>> df.describe(include='all')
               A         B    C    D    E
count   2.000000  2.000000  0.0  2.0    2
unique       NaN       NaN  NaN  NaN    2
top          NaN       NaN  NaN  NaN    2
freq         NaN       NaN  NaN  NaN    1
mean    2.000000  3.000000  NaN  5.0  NaN
std     1.414214  1.414214  NaN  0.0  NaN
min     1.000000  2.000000  NaN  5.0  NaN
25%     1.500000  2.500000  NaN  5.0  NaN
50%     2.000000  3.000000  NaN  5.0  NaN
75%     2.500000  3.500000  NaN  5.0  NaN
max     3.000000  4.000000  NaN  5.0  NaN
>>>

Unfortuately it is not empty. If I use: df2=df and then do df2.decribe() it works fine
Thank you! The solution was the include='all' which included the categorical numeric data as well!

jezrael · Accepted Answer · 2017-09-06 06:13:21Z

1

I think problem should be x119 column was in df before, so only overwrite values.

You can check it by:

print (df['x119'])

Simpliest add new column is by:

print (len(df.columns))
df['x119'] = F
print (len(df.columns))

edited Sep 6, 2017 at 6:13

answered Sep 6, 2017 at 6:04

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

9 Comments

AR_ Over a year ago

Thank you for your answer, just edited my post to clarify.

jezrael Over a year ago

Ok, if check lengt of columns, are always same?

AR_ Over a year ago

Tried exactly what you suggested. got 117 and 118 respectively. This is so weird :/

jezrael Over a year ago

And print (len(df.columns)) before and after is same?

AR_ Over a year ago

I didn't use include='all' for categorical data. Thank you for your time!

|

Collectives™ on Stack Overflow

Pandas df.describe doesn't work after adding new column

2 Answers 2

Case 1: all datatypes are numeric:

Case 2: mixed numeric and object datatypes:

3 Comments

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Case 1: all datatypes are numeric:

Case 2: mixed numeric and object datatypes:

3 Comments

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related