7

I used read_csv() to load a dataset that looks like this

userid
NaN
1.091178e+11
1.137856e+11

I want to convert the user ids to string. One solution is to add keep_default_na=False to read_csv(), which is suggested by this SO: Converting long integers to strings in pandas (to avoid scientific notation)

Let's say I don't want to use keep_default_na=False. Is there any way to convert the user id column to str.

I tried df.userid.astype(str) and I got 1.091178e+11 back. I was expecting the result in the expanded form not scientific form.

What should I do?

2
  • Is possible use parameter dtype={'userid':str} and it works for you? Commented Dec 15, 2016 at 6:51
  • You could apply a string format df.userid.apply(lambda x: '{:.0f}'.format(x)). Commented Dec 15, 2016 at 6:56

2 Answers 2

7

You can use map or apply, as mentioned in this comment:

print (df.userid.map(lambda x: '{:.0f}'.format(x)))
0             nan
1    109117800000
2    113785600000
Name: userid, dtype: object

df.userid = df.userid.map(lambda x: '{:.0f}'.format(x))
print (df)
         userid
0           nan
1  109117800000
2  113785600000

I wondered whether map would be faster, but it is the same:

#[300000 rows x 1 columns]
df = pd.concat([df]*100000).reset_index(drop=True)
#print (df)

In [40]: %timeit (df.userid.map(lambda x: '{:.0f}'.format(x)))
1 loop, best of 3: 211 ms per loop

In [41]: %timeit (df.userid.apply(lambda x: '{:.0f}'.format(x)))
1 loop, best of 3: 210 ms per loop

Another solution is to_string, but it is slow:

print(df.userid.to_string(float_format='{:.0f}'.format))
0            nan
1   109117800000
2   113785600000

In [41]: (df.userid.to_string(float_format='{:.0f}'.format))
1 loop, best of 3: 2.52 s per loop
Sign up to request clarification or add additional context in comments.

1 Comment

though you might want to replace 'nan' back to pd.NA with replace after the map.
4

I just stumbled upon this problem after reading a dataframe from a json file using the read_json method and unfortunately it does not have a keep_default_na parameter.

The solution was to convert the long floats to np.int64 before converting them to str.

In [53]: tweet_id_sample = tweets.iloc[0]['id']
         tweet_id_sample
Out[53]: 8.924206435553362e+17

In [54]: tweet_id_sample.astype(str)
Out[54]: '8.924206435553362e+17'

In [55]: tweet_id_sample.astype(np.int64).astype(str)
Out[55]: '892420643555336192'

In [56]: # This overflows
         tweet_id_sample.astype(int)
Out[56]: -2147483648

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.