2

I have a python script that needs to process a large file. The code works fine if I reduce the original file and run the script but when I run the script on the original data my script takes forever to execute. I am considering using HDFS to store the file and read it from the python script. But in order to use HDFS do I have to convert my python script into a map reduce program or can I use the same code.

2
  • how big is your file? Commented Jul 20, 2015 at 19:11
  • It has 1,60,057 rows and 100 columns. Commented Jul 20, 2015 at 19:14

1 Answer 1

3

You'll like needly to tweak your Python code and then use Hadoop Streaming to process it. This is exactly the type of situation for which streaming was intended.

Sign up to request clarification or add additional context in comments.

5 Comments

Thank you! I was struggling with similar problem (for files with 100k+ lines)
Could you please elaborate a bit more. Do I need to tweak the code to convert it into map reduce?
Have you read through the example I provided? It's pretty much step by step for Python code, so adjust your script to follow its logic regarding reading input from stdin and writing to stdout, as expected by Hadoop Streaming.
Thank you for the inputs. I am actually trying to run a scikit learn extra tree classifier algorithm on a dataset with 1,60,057 rows and 100 columns. I am unable to run this on my local machine and hence trying to use Hadoop Streaming as you mentioned. But, does it help in anyway to use Hadoop for machine learning problems as they are iterative and build models based on the entire train dataset.
Hadoop can be used for iterative models, but it's not the most efficient approach. Take a look at Spark ML, Mahout or Giraph for these types of iterative problems.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.