Pandas Dataframe interpreting columns as float instead of String

Question

I want to import a csv file into a pandas dataframe. There is a column with IDs, which consist of only numbers, but not every row has an ID.

   ID      xyz
0  12345     4.56
1           45.60
2  54231   987.00

I want to read this column as String, but even if I specifiy it with

df=pd.read_csv(filename,dtype={'ID': str})

I get

   ID         xyz
0  '12345.0'    4.56
1   NaN        45.60
2  '54231.0'  987.00

Is there an easy way get the ID as a string without decimal like '12345'without having to edit the Strings after importing the table?

If your concern is output format, then fix this when you export the data (e.g. to_csv, to_string), not by changing your underlying data (which looks fine) to awkward types. — jpp
– jpp, Commented Nov 13, 2018 at 13:18
I think you can upgrade your pandas version and all working nice. — jezrael
– jezrael, Commented Nov 13, 2018 at 13:18
I mean my underlying data is a csv file with an ID that is not ment to be taken numeric but as the name suggest as an identification. String seems to be the best representation for that. — Georg B
– Georg B, Commented Nov 13, 2018 at 13:24

Joe · Accepted Answer · 2018-11-13 12:46:59Z

9

A solution could be this, but after you have imported the df:

df = pd.read_csv(filename)
df['ID'] = df['ID'].astype(int).astype(str)

Or since there are NaN with:

df['ID'] = df['ID'].apply(lambda x: x if pd.isnull(x) else str(int(x)))

edited Nov 13, 2018 at 12:46

answered Nov 13, 2018 at 12:23

Joe

12.4k7 gold badges44 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Georg B Over a year ago

Doesn't work, because I have empty cells, and NaN values can't be converted to int

Georg B Over a year ago

That worked thank you. Was trying smthng similar but yours works way better

Nazanin Zinouri Over a year ago

This saved me after half an hour of looking through other answers that did not work. Thank you!

rahnama7m Over a year ago

What shall we do in we have cell with "00212" value? @joe

jezrael · Accepted Answer · 2018-11-13 13:08:28Z

1

Possible solution if missing values are not in numeric columns - ad parameter keep_default_na=False for not convert empty values to strings, but it NOT convert to NaNs in all data, not always in first column, check also docs:

import pandas as pd

temp=u"""ID;xyz
0;12345;4.56
1;;45.60
2;54231;987.00"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep=";", dtype={'ID': str}, keep_default_na=False)
    print (df)
      ID     xyz
0  12345    4.56
1          45.60
2  54231  987.00

EDIT:

For me in pandas 0.23.4 working your solution perfectly, so it means bug in lower pandas versions:

import pandas as pd

temp=u"""ID;xyz
0;12345;4.56
1;;45.60
2;54231;987.00"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep=";", dtype={'ID': str})
print (df)
      ID     xyz
0  12345    4.56
1    NaN   45.60
2  54231  987.00

edited Nov 13, 2018 at 13:08

answered Nov 13, 2018 at 12:30

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

7 Comments

Georg B Over a year ago

It works for your example, but not my csv file. Only difference to previous result is that NaN became an empty string. I am really confused, I checked my file again, but there are definetly no floats there.

jezrael Over a year ago

@GeorgB - what is expected output in ID column instead empty string?

Georg B Over a year ago

the empty columns don't matter, as long as I have an easy way to filter them out. I only need the non empty IDs as String without a ".0" at the end. The user Joe gave an answer that worked, so I can continue. Just have the feeling there is a way to do it while reading in the file and not afterwards.

jezrael Over a year ago

@GeorgB - df['ID'] = df['ID'].apply(lambda x: x if pd.isnull(x) else str(int(x))) is your solution?

yeamusic21 Over a year ago

Thanks for this solution! Your EDIT: section with dtype={'ID': str} solved the issue for me! I was losing leading zeros that I wanted to keep, so I needed to read it with the correct schema. Great suggestion!

|

jpp · Accepted Answer · 2018-11-13 13:06:32Z

0

Specify float format when writing to csv

Since your underlying problem is output format when exporting data, no manipulation is required. Just use:

df.to_csv('file.csv', float_format='%.0f')

Since you want only specific columns to have this formatting you can use to_string:

def format_int(x):
    return f'{x:.0f}' if x==x else ''

with open('file.csv', 'w') as fout:
    fout.write(df.to_string(formatters={'ID': format_int}))

Keep numeric data numeric

There is a column with IDs, which consist of only numbers

If your column only includes numbers, don't convert to strings! Your desire to convert to strings seems an XY problem. Numeric identifiers should stay numeric.

Float `NaN` prompts upcasting

Your issue is NaN values can't coexist with integers in a numeric series. Since NaN is a float, Pandas forces upcasting. This is natural, because the object dtype alternative is inefficient and not recommended.

If viable, you can use a sentinel value, e.g. -1 to indicate nulls:

df['ID'] = pd.to_numeric(df['ID'], errors='coerce').fillna(-1).astype(int)

print(df)

      ID     xyz
0  12345    4.56
1     -1   45.60
2  54231  987.00

edited Nov 13, 2018 at 13:06

answered Nov 13, 2018 at 12:28

jpp

166k37 gold badges301 silver badges363 bronze badges

5 Comments

jezrael Over a year ago

If your column only includes numbers, don't convert to strings! - it OP need convert numeric to strings, why not? What is wrong about it?

jpp Over a year ago

@jezrael, XY problem: "The XY problem is asking about your attempted solution rather than your actual problem."

jezrael Over a year ago

OK, please add your commnet about XY problem to comment under question, but if need convert to strings numeric column it is absolutely not wrong.

Georg B Over a year ago

I need them as strings, or at least integers that can be converted to strings. I will try your method if I dont find another option, but I'll have to remove the -1 every time I save the file.

jezrael Over a year ago

I donvote becasue don't convert to strings! is wrong statement.

Collectives™ on Stack Overflow

Pandas Dataframe interpreting columns as float instead of String

3 Answers 3

4 Comments

7 Comments

Specify float format when writing to csv

Keep numeric data numeric

Float `NaN` prompts upcasting

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

7 Comments

Specify float format when writing to csv

Keep numeric data numeric

Float NaN prompts upcasting

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related

Float `NaN` prompts upcasting