0

I have a 20GB transaction data set from kaggle (http://www.kaggle.com/c/acquire-valued-shoppers-challenge/data).

row are over 300 million and variables are 11.

It is too heavy to handle with R. So I want to filter data.

enter image description here

id class is interger64.

Unique id has 311541 and I want sample 20000.

I'm using data.table But there is an error like the picture.

Is there a way to sample id?

2
  • May be you can try by converting to character class. Also check this link stackoverflow.com/questions/15614846/… Commented Nov 19, 2014 at 6:14
  • thanks! converting to character works well. Commented Nov 19, 2014 at 8:26

1 Answer 1

1

If I recall correctly, integer64 are just doubles masked as integer. Maybe the best way to obtain your subset without making any copy is to use the setattr function in data.table. Try this:

#remove the integer64 class
setattr(transaction$id,"class",NULL)
custom_sample<-sample(unique(transaction$id),20000)
sample_transac<-transaction[id %in% custom_sample,]
#give the integer64 class back
setattr(sample_transac$id,"class","integer64")
Sign up to request clarification or add additional context in comments.

1 Comment

it may be working. but my computer isn't working. there is an allocated size error. thank you for answering.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.