0

I want read csv data from hdfs server, but it throws an Exception,like below:

    hdfsSeek(desiredPos=64000000): FSDataInputStream#seek error:
    java.io.EOFException: Cannot seek after EOF
    at 
    org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1602)
    at 
    org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)

My Python code:

from dask import dataframe as dd
df = dd.read_csv('hdfs://SER/htmpa/a.csv').head(n=3)

csv file:

    user_id,item_id,play_count
    0,0,500
    0,1,3
    0,3,1
    1,0,4
    1,3,1
    2,0,1
    2,1,1
    2,3,5
    3,0,1
    3,3,4
    4,1,1
    4,2,8
    4,3,4

2
  • What hdfs driver library are you using? I recommend using pyarrow instead of hdfs3 - you can do this by specifying driver='pyarrow' to the read_csv call. Commented Jun 25, 2019 at 15:37
  • "specifying driver='pyarrow' " not work. still throw seek error Commented Jun 26, 2019 at 8:07

1 Answer 1

1

Are you running within and IDE or a jupyter notebook?
We are running on a Cloudera distribution and also get a similar error. From what we understand it is not connected to dask but rather to our hadoop configuration.
In any case we successfully use the pyarrow library when accessing hdfs. be aware that if you need to access parquet files run with version 0.12 and not 0.13 see discussion on github
Update
pyarrow version 0.14 is out and should solve the problem.

Sign up to request clarification or add additional context in comments.

1 Comment

ths.pyarrow is work. But a lot code work by dask,:(.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.