2

I have a very big CSV file (tens of Gigas) containing web logs with the following columns: user_id, time_stamp, category_clicked. I have to build a scorer to identify what categories users like and dislike.

My problem comes when I have to load my CSV using pandas.read_csv . Indeed, I would like to use the chunksize parameter to split it, but since I have to proceed a 'groupby operation' on the user_ids to make my calculation (I don't want my score to be too trivial), I don't know how to smartly split my data because If I only use chunsize, I won't be able to properly use groupby.

To be simple, I want to do a calculation for each user, which depends on the timestamp and the category clicked. For instance, give 1 point to the user if his click happened one month ago, 2 points if it happened two weeks ago, and 4 points if it happened last week.

How can I do? And I am missing something?

4
  • Possibly related: Pandas GroupBy Mean of Large DataSet in CSV. Commented Aug 19, 2014 at 15:45
  • I am sorry but It is not really related, because the solution given is very specific to the function mean(), and won't work in my case. Commented Aug 19, 2014 at 15:49
  • 2
    You basically need to do this: stackoverflow.com/questions/15798209/…. In a nutshell, read in your data using read_csv, save to a hdfstore (table format). Then you can get the keys of the groupby (user_id), and aggregate as needed with a minimum of queries. This is quite scalable. Commented Aug 19, 2014 at 16:04
  • I would really appreciate if someone could give me more details in the case that there is more than a million groups. I think It could be usefull to more than me. Commented Aug 21, 2014 at 7:41

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.