0

A little background:

I am running a binary stellar evolution code and storing the evolution histories as gzipped .dat files. I used to have a smaller dataset resulting in some ~2000 .dat files, which I read during post-processing by appending the list data from each file to create a 3d list. Each .dat file looks somewhat like this file.

But recently I started working with a larger dataset and the number of evolutionary history files rose to ~100000. So I decided to compress the .dat files as gzips and save them in a zipped folder. The reason being, that I am doing all this on a remote server and have a limited disk quota.

Main query:

During post-processing, I try to read data using pandas from all these files as 2d numpy arrays which are stacked to form a 3d list (each file has a different length so I could not use numpy.append and have to use lists instead). To achieve this, I use this:

def read_evo_history(EvoHist, zipped, z):
    ehists = []
    for i in range( len(EvoHist) ):
        if zipped == True:
            try:  
                ehists.append( pd.read_csv(z.open(EvoHist[i]), delimiter = "\t", compression='gzip', header=None).to_numpy() )
            except pd.errors.EmptyDataError:
                pass
    return ehists

outdir = "plots"
indir = "OutputFiles_allsys"


z = zipfile.ZipFile( indir+'.zip' )

EvoHist = []
for filename in z.namelist():
    if not os.path.isdir(filename):
        # read the file
        if filename[0:len("OutputFiles_allsys/EvoHist")] == "OutputFiles_allsys/EvoHist":
            EvoHist.append( filename )


zipped = True
ehists = read_evo_history(EvoHist, zipped, z)
del z                                 # Cleanup (if there's no further use of it after this)

The problem I am now facing is that the one particular column in the data is being read as a list of strings, rather than float. Do I need to somehow convert the datatype while reading the file? Or is this being caused because of datatype inconsistencies in the files being read? Is there a way to get the data as a 3d list of numpy arrays of floats?

P.S.: If this is being caused by inconsistencies in the input files, then I am afraid I won't be able to run my binary stellar evolution code again as it takes days to produce all these files.

I will be more than happy to clarify more on this if needed. Thanks in advance for your time and energy.

Edit:

I noticed that only the 16th column of some files is being read in as a string. And I think this is because there are some NaN values in there, but I may be wrong.

This image shows the raw data with NaN values pointed out. A demonstration showing that particular column bein read as string can be here. However, another column is read as float: image.

8
  • It would help your question so much to include a couple of representative lines from your file (especially those that you found to have NaN values - how are they represented?). Not everyone will want, or be able to, download something from a google drive, and questions should be as much self-contained as possible according to Stack Overflow rules. It is also extremely unlikely that gzip is relevant, this should be purely about read_csv and to_numpy. Commented Apr 21, 2020 at 7:39
  • @Amadan Sorry, this is my first time posting a question. Thanks for your guidance. I added a few things. I was planning to remove the irrelevant gzip info too. But I stumbled upon what needs to be done to overcome what I was facing. Do I delete my question now, or should I add a solution to it? Commented Apr 21, 2020 at 8:27
  • Deleting questions can eventually lead to being banned from asking questions (well, I believe no this one, as it is not downvoted, but better safe than sorry :) ). So I definitely suggest providing the answer. It might even net you some reputation. (And no worries, everyone was new once.) Commented Apr 21, 2020 at 8:28
  • @Amadan Thanks, will add the solution soon :) Commented Apr 21, 2020 at 8:35
  • 1
    @Amadan No worries at all. It's posted now. Thanks for all your help, you initiated my StackOverflow journey :) Commented Apr 21, 2020 at 12:14

1 Answer 1

1

The workaround for overcoming a missing value was simple, pandas.read_csv has a parameter called na_values which allows users to pass specified values that they want to be read as NaNs. From the pandas docs:

na_values: scalar, str, list-like, or dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default, the following values are interpreted as NaN: ‘’, ... ‘NA’, ...`.

Pandas itself is smart enough to automatically recognize those values without us explicitly stating it. But in my case, the file had nan values as 'nan ' (yeah, with a space!) which is why I was facing this issue. A minute change in the code fixed this,

pd.read_csv(z.open(EvoHist[i]), delimiter = "\t", 
compression='gzip', header=None, na_values = 'nan ').to_numpy()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.