Large data using pandas

I have a very big CSV file (tens of Gigas) containing web logs with the following columns: user_id, time_stamp, category_clicked. I have to build a scorer to identify what categories users like and dislike.

My problem comes when I have to load my CSV using pandas.read_csv . Indeed, I would like to use the chunksize parameter to split it, but since I have to proceed a 'groupby operation' on the user_ids to make my calculation (I don't want my score to be too trivial), I don't know how to smartly split my data because If I only use chunsize, I won't be able to properly use groupby.

To be simple, I want to do a calculation for each user, which depends on the timestamp and the category clicked. For instance, give 1 point to the user if his click happened one month ago, 2 points if it happened two weeks ago, and 4 points if it happened last week.

How can I do? And I am missing something?

edited Aug 19, 2014 at 15:52

asked Aug 19, 2014 at 15:39

sweeeeeet

1,8394 gold badges28 silver badges54 bronze badges

Possibly related: Pandas GroupBy Mean of Large DataSet in CSV.

chrisaycock
– chrisaycock

2014-08-19 15:45:09 +00:00
Commented Aug 19, 2014 at 15:45
I am sorry but It is not really related, because the solution given is very specific to the function mean(), and won't work in my case.

sweeeeeet
– sweeeeeet

2014-08-19 15:49:42 +00:00
Commented Aug 19, 2014 at 15:49
2

You basically need to do this: stackoverflow.com/questions/15798209/…. In a nutshell, read in your data using read_csv, save to a hdfstore (table format). Then you can get the keys of the groupby (user_id), and aggregate as needed with a minimum of queries. This is quite scalable.

Jeff
– Jeff

2014-08-19 16:04:20 +00:00
Commented Aug 19, 2014 at 16:04
I would really appreciate if someone could give me more details in the case that there is more than a million groups. I think It could be usefull to more than me.

sweeeeeet
– sweeeeeet

2014-08-21 07:41:26 +00:00
Commented Aug 21, 2014 at 7:41

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Large data using pandas

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked