4

I have a huge custom text file (cant load the entire data into one pandas dataframe) which I want to read into Dask dataframe. I wrote a generator to read and parse the data in chunks and create pandas dataframes. I want to load these pandas dataframes into a dask dataframe and perform operations on the resulting dataframe (things like creating calculated columns, extracting parts of the dataframe, plotting etc). I tried using Dask bag but couldnt succeed. So I decided to write the resulting dataframe into an HDFStore and then use Dask to read from the HDFStore file. This worked well when I was doing it from my own computer. Code below.

cc = read_custom("demo.xyz", chunks=1000) # Generator of pandas dataframes
from pandas import HDFStore
s = HDFStore("demo.h5")
for c in cc:
    s.append("data", c, format='t', append=True)
s.close()

import dask.dataframe as dd
ddf = dd.read_hdf("demo.h5", "data", chunksize=100000)
seqv = (
    (
        (ddf.sxx - ddf.syy) ** 2
        + (ddf.syy - ddf.szz) ** 2
        + (ddf.szz - ddf.sxx) ** 2
        + 6 * (ddf.sxy ** 2 + ddf.syz ** 2 + ddf.sxz ** 2)
    )
    / 2
) ** 0.5
seqv.compute()

Since the last compute was slow, I decided to distribute it over a few systems on my LAN and started a scheduler on my machine and couple of workers in other systems. And fired up a Client as below.

from dask.distributed import Client
client = Client('mysystemip:8786') #Establishing connection with the scheduler all fine.

And then read in the Dask dataframe. However, I got error below when I executed seqv.compute().

HDF5ExtError: HDF5 error back trace

  File "H5F.c", line 509, in H5Fopen
    unable to open file
  File "H5Fint.c", line 1400, in H5F__open
    unable to open file
  File "H5Fint.c", line 1615, in H5F_open
    unable to lock the file
  File "H5FD.c", line 1640, in H5FD_lock
    driver lock request failed
  File "H5FDsec2.c", line 941, in H5FD_sec2_lock
    unable to lock file, errno = 11, error message = 'Resource temporarily unavailable'

End of HDF5 error back trace

Unable to open/create file 'demo.h5'

I have made sure that all workers have access to demo.h5 file. I tried passing in the lock=False in read_hdf. Got the same error.

Isn't this possible to do? May be try another file format? I guess writing each pandas dataframe to separate files may work, but I'm trying to avoid it (I dont even want an intermediate HDFS file). But before I get to that route, I'd like to know if there is any other better approach to solve the problem.

Thanks for any suggestions!

1 Answer 1

3

If you want to read data from a custom format in a text file I recommend using the dask.bytes.read_bytes function, which returns a list of delayed objects, each of which points to a block of bytes from your file. Those blocks will be cleanly separated by a line delimiter by default.

Something like this might work:

def parse_bytes(b: bytes) -> pandas.DataFrame:
    ...

blocks = dask.bytes.read_bytes("my-file.txt", delimiter=b"\n")
dataframes = [dask.delayed(parse_bytes)(block) for block in blocks]
df = dask.dataframe.from_delayed(dataframes)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the suggestion. The file format that I'm working with has few lines of header data. I search for a particular text block to identify where the actual useful data starts. Can I use this method to read such a file? I see that there is a not_zero parameter in read_bytes method to discard header. Can I use that to discard multiple lines? If so, I can do an advance pass on the file and identify how many lines/bytes I need to skip before I get to the actual data.
You can probably use that block of text as a delimiter if you want. You will be guaranteed that Dask always splits on that text. I recommend reading the read_bytes docstring.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.