1

I have a process which takes input data, processes it and outputs the data.During this it generates two logs IN.log and OUT.log

IN.log contains when the data came in and the of the data. OUT.log contains when the data was processed and the of the data. so... IN.log contains in-time id

OUT.log contains out-time id

Now, as part of processing using hadoop streams using python, I would like to join these two files and come with diff of intime and out time and the id of the data .

For eg:

2seconds id123

3seconds id112

Any pointers as to how this can be achieved using PYTHON?

2
  • What have you tried so far? What do you need hadoop, mr and other stuff for? Commented Nov 26, 2013 at 12:08
  • These files are going to be pretty big(few GBs),so chose the hadoop way...I was able to achieve this using Hive, but also wanted to check if hadoop streaming provides faster processing. Commented Nov 26, 2013 at 12:48

1 Answer 1

1

Take a look at MRjob helper package for hadoop jobs running. It would be a pretty easy to write a map/reduce for this task, something along the lines of following code

from datetime import datetime
from MRJob import MRJob

class JoinJob(MRJob):
    fmt = '%Y-%M-%d'
    def steps(self):
        return [self.mr(mapper=self.mapper, 
                        reducer=self.reducer)]
    def mapper(self, rec_time, rec_id):
        yield rec_id, rec_time

    def reducer(self, rec_id, datetime_strs):
        datetimes = map(lambda x: datetime.strptime(x, self.fmt), 
                            datetime_strs)
        delta_secs = (max(datetimes) - min(datetimes)).total_seconds()
        yield rec_id, delta_secs

if __name__ == '__main__':
    JoinJob.run()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.