Unable to read text from csv into dataframe when text is "null" and "nan"

Question

I am trying to upload Google n-gram word frequency data into a dataframe.

Dataset can be found here: https://www.kaggle.com/wheelercode/dictionary-word-frequency

A couple of words are not loading unfortunately. The word "null" appears on row 9156 of the csv file and the word "nan" appears on row 17230 of the csv file.

This is how I am uploading the data

my_freq_df = pd.read_csv('ngram_freq_dict.csv',  dtype = {"word": str, "count": np.int32} ) 
my_freq_df['word'] = my_freq_df['word'].astype("string")

Unfortunately, when I try to check if those words were loaded as strings, I get that they weren't

count = 0

for index, row in my_freq_df.iterrows():
    count += 1
    try:
        len(row['word'])
    except:
        print(row['word'])
        print(count)
        print("****____*****")

We can see the image of the output of the try, except and we can see that I cant calculate the length of the words "nan" and "null". Both words are being read as NA.

How do I fix this?

ncoish · Accepted Answer · 2022-02-20 04:53:08Z

2

Pandas treats a certain set of values as "NA" by default, but you can explicitly tell it to ignore those defaults with keep_default_na=False. "null" and "nan" both happen to be in that list!

my_freq_df = pd.read_csv(
    'ngram_freq_dict.csv',
    dtype = {"word": str, "count": np.int32},
    keep_default_na=False
)

As of today, the complete set of strings that it treats as NA by default is:

[
 "", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", 
 "-nan", "1.#IND", "1.#QNAN", "<NA>", "N/A", "NA", "NULL",
 "NaN", "n/a", "nan", "null"
]

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

edited Feb 20, 2022 at 4:53

answered Feb 20, 2022 at 4:47

ncoish

3432 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Amatya Over a year ago

is there some way to customize the list of stings that is treated as NA?

ncoish Over a year ago

Yep! Use the na_values parameter. In the docs it's right next to keep_default_na. You'll need to use both if you want to exclude the defaults and define your own.

Collectives™ on Stack Overflow

Unable to read text from csv into dataframe when text is "null" and "nan"

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related