I have a python script that needs to process a large file. The code works fine if I reduce the original file and run the script but when I run the script on the original data my script takes forever to execute. I am considering using HDFS to store the file and read it from the python script. But in order to use HDFS do I have to convert my python script into a map reduce program or can I use the same code.
1 Answer
You'll like needly to tweak your Python code and then use Hadoop Streaming to process it. This is exactly the type of situation for which streaming was intended.
5 Comments
Laszlowaty
Thank you! I was struggling with similar problem (for files with 100k+ lines)
sacrac
Could you please elaborate a bit more. Do I need to tweak the code to convert it into map reduce?
Jakob Homan
Have you read through the example I provided? It's pretty much step by step for Python code, so adjust your script to follow its logic regarding reading input from stdin and writing to stdout, as expected by Hadoop Streaming.
sacrac
Thank you for the inputs. I am actually trying to run a scikit learn extra tree classifier algorithm on a dataset with 1,60,057 rows and 100 columns. I am unable to run this on my local machine and hence trying to use Hadoop Streaming as you mentioned. But, does it help in anyway to use Hadoop for machine learning problems as they are iterative and build models based on the entire train dataset.
Jakob Homan
Hadoop can be used for iterative models, but it's not the most efficient approach. Take a look at Spark ML, Mahout or Giraph for these types of iterative problems.