Reading large hdfs file from a python script

Question

I have a python script that needs to process a large file. The code works fine if I reduce the original file and run the script but when I run the script on the original data my script takes forever to execute. I am considering using HDFS to store the file and read it from the python script. But in order to use HDFS do I have to convert my python script into a map reduce program or can I use the same code.

how big is your file?

Laszlowaty
– Laszlowaty

2015-07-20 19:11:43 +00:00
Commented Jul 20, 2015 at 19:11 — Laszlowaty
– Laszlowaty, Commented Jul 20, 2015 at 19:11
It has 1,60,057 rows and 100 columns.

sacrac
– sacrac

2015-07-20 19:14:45 +00:00
Commented Jul 20, 2015 at 19:14 — sacrac
– sacrac, Commented Jul 20, 2015 at 19:14

Jakob Homan · Accepted Answer · 2015-07-20 19:17:30Z

3

You'll like needly to tweak your Python code and then use Hadoop Streaming to process it. This is exactly the type of situation for which streaming was intended.

answered Jul 20, 2015 at 19:17

Jakob Homan

2,3041 gold badge14 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Laszlowaty Over a year ago

Thank you! I was struggling with similar problem (for files with 100k+ lines)

sacrac Over a year ago

Could you please elaborate a bit more. Do I need to tweak the code to convert it into map reduce?

Jakob Homan Over a year ago

Have you read through the example I provided? It's pretty much step by step for Python code, so adjust your script to follow its logic regarding reading input from stdin and writing to stdout, as expected by Hadoop Streaming.

sacrac Over a year ago

Thank you for the inputs. I am actually trying to run a scikit learn extra tree classifier algorithm on a dataset with 1,60,057 rows and 100 columns. I am unable to run this on my local machine and hence trying to use Hadoop Streaming as you mentioned. But, does it help in anyway to use Hadoop for machine learning problems as they are iterative and build models based on the entire train dataset.

Jakob Homan Over a year ago

Hadoop can be used for iterative models, but it's not the most efficient approach. Take a look at Spark ML, Mahout or Giraph for these types of iterative problems.

Collectives™ on Stack Overflow

Reading large hdfs file from a python script

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related