A little background:
I am running a binary stellar evolution code and storing the evolution histories as gzipped .dat files. I used to have a smaller dataset resulting in some ~2000 .dat files, which I read during post-processing by appending the list data from each file to create a 3d list. Each .dat file looks somewhat like this file.
But recently I started working with a larger dataset and the number of evolutionary history files rose to ~100000. So I decided to compress the .dat files as gzips and save them in a zipped folder. The reason being, that I am doing all this on a remote server and have a limited disk quota.
Main query:
During post-processing, I try to read data using pandas from all these files as 2d numpy arrays which are stacked to form a 3d list (each file has a different length so I could not use numpy.append and have to use lists instead). To achieve this, I use this:
def read_evo_history(EvoHist, zipped, z):
ehists = []
for i in range( len(EvoHist) ):
if zipped == True:
try:
ehists.append( pd.read_csv(z.open(EvoHist[i]), delimiter = "\t", compression='gzip', header=None).to_numpy() )
except pd.errors.EmptyDataError:
pass
return ehists
outdir = "plots"
indir = "OutputFiles_allsys"
z = zipfile.ZipFile( indir+'.zip' )
EvoHist = []
for filename in z.namelist():
if not os.path.isdir(filename):
# read the file
if filename[0:len("OutputFiles_allsys/EvoHist")] == "OutputFiles_allsys/EvoHist":
EvoHist.append( filename )
zipped = True
ehists = read_evo_history(EvoHist, zipped, z)
del z # Cleanup (if there's no further use of it after this)
The problem I am now facing is that the one particular column in the data is being read as a list of strings, rather than float. Do I need to somehow convert the datatype while reading the file? Or is this being caused because of datatype inconsistencies in the files being read? Is there a way to get the data as a 3d list of numpy arrays of floats?
P.S.: If this is being caused by inconsistencies in the input files, then I am afraid I won't be able to run my binary stellar evolution code again as it takes days to produce all these files.
I will be more than happy to clarify more on this if needed. Thanks in advance for your time and energy.
Edit:
I noticed that only the 16th column of some files is being read in as a string. And I think this is because there are some NaN values in there, but I may be wrong.
This image shows the raw data with NaN values pointed out. A demonstration showing that particular column bein read as string can be here. However, another column is read as float: image.
NaNvalues - how are they represented?). Not everyone will want, or be able to, download something from a google drive, and questions should be as much self-contained as possible according to Stack Overflow rules. It is also extremely unlikely that gzip is relevant, this should be purely aboutread_csvandto_numpy.