Reading file with pandas.read_csv: why is a particular column read as a string rather than float?

Question

A little background:

I am running a binary stellar evolution code and storing the evolution histories as gzipped .dat files. I used to have a smaller dataset resulting in some ~2000 .dat files, which I read during post-processing by appending the list data from each file to create a 3d list. Each .dat file looks somewhat like this file.

But recently I started working with a larger dataset and the number of evolutionary history files rose to ~100000. So I decided to compress the .dat files as gzips and save them in a zipped folder. The reason being, that I am doing all this on a remote server and have a limited disk quota.

Main query:

During post-processing, I try to read data using pandas from all these files as 2d numpy arrays which are stacked to form a 3d list (each file has a different length so I could not use numpy.append and have to use lists instead). To achieve this, I use this:

def read_evo_history(EvoHist, zipped, z):
    ehists = []
    for i in range( len(EvoHist) ):
        if zipped == True:
            try:  
                ehists.append( pd.read_csv(z.open(EvoHist[i]), delimiter = "\t", compression='gzip', header=None).to_numpy() )
            except pd.errors.EmptyDataError:
                pass
    return ehists

outdir = "plots"
indir = "OutputFiles_allsys"


z = zipfile.ZipFile( indir+'.zip' )

EvoHist = []
for filename in z.namelist():
    if not os.path.isdir(filename):
        # read the file
        if filename[0:len("OutputFiles_allsys/EvoHist")] == "OutputFiles_allsys/EvoHist":
            EvoHist.append( filename )


zipped = True
ehists = read_evo_history(EvoHist, zipped, z)
del z                                 # Cleanup (if there's no further use of it after this)

The problem I am now facing is that the one particular column in the data is being read as a list of strings, rather than float. Do I need to somehow convert the datatype while reading the file? Or is this being caused because of datatype inconsistencies in the files being read? Is there a way to get the data as a 3d list of numpy arrays of floats?

P.S.: If this is being caused by inconsistencies in the input files, then I am afraid I won't be able to run my binary stellar evolution code again as it takes days to produce all these files.

I will be more than happy to clarify more on this if needed. Thanks in advance for your time and energy.

Edit:

I noticed that only the 16th column of some files is being read in as a string. And I think this is because there are some NaN values in there, but I may be wrong.

This image shows the raw data with NaN values pointed out. A demonstration showing that particular column bein read as string can be here. However, another column is read as float: image.

It would help your question so much to include a couple of representative lines from your file (especially those that you found to have NaN values - how are they represented?). Not everyone will want, or be able to, download something from a google drive, and questions should be as much self-contained as possible according to Stack Overflow rules. It is also extremely unlikely that gzip is relevant, this should be purely about read_csv and to_numpy. — Amadan
– Amadan, Commented Apr 21, 2020 at 7:39
@Amadan Sorry, this is my first time posting a question. Thanks for your guidance. I added a few things. I was planning to remove the irrelevant gzip info too. But I stumbled upon what needs to be done to overcome what I was facing. Do I delete my question now, or should I add a solution to it? — gautam-404
– gautam-404, Commented Apr 21, 2020 at 8:27
Deleting questions can eventually lead to being banned from asking questions (well, I believe no this one, as it is not downvoted, but better safe than sorry :) ). So I definitely suggest providing the answer. It might even net you some reputation. (And no worries, everyone was new once.) — Amadan
– Amadan, Commented Apr 21, 2020 at 8:28
@Amadan No worries at all. It's posted now. Thanks for all your help, you initiated my StackOverflow journey :) — gautam-404
– gautam-404, Commented Apr 21, 2020 at 12:14

gautam-404 · Accepted Answer · 2020-04-21 12:12:22Z

1

The workaround for overcoming a missing value was simple, pandas.read_csv has a parameter called na_values which allows users to pass specified values that they want to be read as NaNs. From the pandas docs:

na_values: scalar, str, list-like, or dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default, the following values are interpreted as NaN: ‘’, ... ‘NA’, ...`.

Pandas itself is smart enough to automatically recognize those values without us explicitly stating it. But in my case, the file had nan values as 'nan ' (yeah, with a space!) which is why I was facing this issue. A minute change in the code fixed this,

pd.read_csv(z.open(EvoHist[i]), delimiter = "\t", 
compression='gzip', header=None, na_values = 'nan ').to_numpy()

answered Apr 21, 2020 at 12:12

gautam-404

215 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Reading file with pandas.read_csv: why is a particular column read as a string rather than float?

A little background:

Main query:

Edit:

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

A little background:

Main query:

Edit:

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related