Pandas read_csv with 'keep_default_na=False' causing change in data type of values. How to prevent this?

Question

I'm trying to read a csv file which has 'NA' value as data. When I use the 'keep_default_na = false' option to keep NA values it's impacting other columns with empty rows.

The Data: csv pic

 colA  colB  colC             
'abc' ,    ,  NA
'ljk' , 10 ,  'Paris' 
'xyz' , 25 ,  NA

Here, I want to keep NA values in column 'colC'. I'm reading the csv like this.

DF = pandas.read_csv(csv, keep_default_na=False)

Now I can see NA values are being present in DF, but the values in the second column 'colB' are present as string ('10','25'), not as numbers.

This is happening if there is an empty row in a column with numeric values.

How can I apply 'keep_default_na= False' and still read other values in the same dType?

do you want them as float or object? your title is confusing — Bending Rodriguez
– Bending Rodriguez, Commented May 3, 2024 at 9:30

antoine · Accepted Answer · 2024-05-03 11:10:41Z

1

Just do :

import pandas as pd

df = pd.read_csv(csv, keep_default_na=False)
df['colB'] = pd.to_numeric(df['colB'], errors='coerce')

errors='coerce' will ensure that NA string values are translated to NaNs for numeric processing. If you don't want NaNs but rather keep them as 0 or another value, you can fill these NaNs using fillna().

Here is the output with your data :

  colA  colB    colC
0  abc    NaN     NA
1  ljk   10.0  Paris
2  xyz   25.0     NA

[edit for automatic detection]

Just try to do to the to_numeric() approach on each column. If there is any non-NaN value post conversion, it indicates that it is numeric and we replace the column by the converted one.

for col in df.columns:
    temp_col = pd.to_numeric(df[col], errors='coerce')
    
    if temp_col.notna().any():
        df[col] = temp_col

edited May 3, 2024 at 11:10

answered May 3, 2024 at 10:19

antoine

5982 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Jack_coder345 Over a year ago

Hi, what if it's a large csv file with multiple numeric columns with few empty rows. It's impossible to apply errors='coerce' to all the affected columns.

antoine Over a year ago

you can define numeric_columns = ['colB', 'colD', 'colG'] and then have a for loop over numeric_columns items and use df[col] = pd.to_numeric(df[col], errors='coerce').

Jack_coder345 Over a year ago

I'm looking for a common solution. This code should work without knowing the columns names. I can't keep creating a new list for each csv file.

mozway Over a year ago

@Suri_362 but you know that ColC should be excluded, by name?

antoine Over a year ago

i updated my answer for automated detection

Collectives™ on Stack Overflow

Pandas read_csv with 'keep_default_na=False' causing change in data type of values. How to prevent this?

1 Answer 1

[edit for automatic detection]

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

[edit for automatic detection]

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related