0

I have a dataframe that looks like -

   block  lat  lon
0      0  112   50
1      0  112   50
2      0  112   50
3      1  105   20
4      1  105   20
5      2  130   30

and I want to first groupby block and then apply a function to the lat lon columns. eg

df['location_id'] = df.groupby('block').apply(lambda x: get_location_id(x['lat'], x['lon'])

For each lat/lon my function returns an ID. I want a new column that has that ID. When I try above it doesn't work, and the lambda function seems to dislike me using axis=1 or similar. ie -

   block  lat  lon location_id
0      0  112   50  1
1      0  112   50  1
2      0  112   50  1
3      1  105   20  23
4      1  105   20  23
5      2  130   30  15

I'd like to avoid just applying the function to the ungrouped dataframe because my dataset is quite large and that will be slow.

edit: the function returns a string id. Takes in lat lon and returns a single string value.

2
  • please show us the function and how the desired output column is calculated. Commented Nov 17, 2022 at 7:57
  • As said, please add the function get_location_id to your question. The edit didn't help very much. Commented Nov 17, 2022 at 8:48

1 Answer 1

1

Depending on how "large" your "large" Dataset is, there might be different solutions... And I'm not 100% certain you can (or should) do what you want with groupby. I'd suggest (and I'm relatively sure that it will work even in distributed environments) to do the following:

  1. Create a new dataframe with non-duplicated block,lat,lon combinations.
  2. Apply your function to that dataframe.
  3. INNER join that dataframe to your original dataframe on those three columns. (While this feels bad, inner joins are usually done via hash joins and are usually quite fast even on spark clusters.)

Edit: Pandas is quite fine with handling datasets that take gigabytes in RAM as long as you have the ram, so just applying the functio might be a lot more viable than you think.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the advice. It's useful to know joins use hash because I'm trying to maintain speed. I'm effectively trying to add a value to a new column for every unique block and it just seemed like something I should've been able to do in one step. Would using Map between the block id column of each DF be faster or equal to an inner join?
Joins with pandas are simply in memory and don't need any fancy hashing or so... but if you ever have a dataset that needs distributed computing with spark, then inner joins usually are hash joins if my memory does not fail me. Or at least they are quite fast even in a distributed setting. Edit: It helps in your case even more that one of the datasets will most likely be significantly smaller than the other.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.