Apply function to multiple columns of a groupby object

Question

I have a dataframe that looks like -

   block  lat  lon
0      0  112   50
1      0  112   50
2      0  112   50
3      1  105   20
4      1  105   20
5      2  130   30

and I want to first groupby block and then apply a function to the lat lon columns. eg

df['location_id'] = df.groupby('block').apply(lambda x: get_location_id(x['lat'], x['lon'])

For each lat/lon my function returns an ID. I want a new column that has that ID. When I try above it doesn't work, and the lambda function seems to dislike me using axis=1 or similar. ie -

   block  lat  lon location_id
0      0  112   50  1
1      0  112   50  1
2      0  112   50  1
3      1  105   20  23
4      1  105   20  23
5      2  130   30  15

I'd like to avoid just applying the function to the ungrouped dataframe because my dataset is quite large and that will be slow.

edit: the function returns a string id. Takes in lat lon and returns a single string value.

please show us the function and how the desired output column is calculated. — Rabinzel
– Rabinzel, Commented Nov 17, 2022 at 7:57
As said, please add the function get_location_id to your question. The edit didn't help very much. — Rabinzel
– Rabinzel, Commented Nov 17, 2022 at 8:48

miriess · Accepted Answer · 2022-11-17 08:48:51Z

1

Depending on how "large" your "large" Dataset is, there might be different solutions... And I'm not 100% certain you can (or should) do what you want with groupby. I'd suggest (and I'm relatively sure that it will work even in distributed environments) to do the following:

Create a new dataframe with non-duplicated block,lat,lon combinations.
Apply your function to that dataframe.
INNER join that dataframe to your original dataframe on those three columns. (While this feels bad, inner joins are usually done via hash joins and are usually quite fast even on spark clusters.)

Edit: Pandas is quite fine with handling datasets that take gigabytes in RAM as long as you have the ram, so just applying the functio might be a lot more viable than you think.

answered Nov 17, 2022 at 8:48

miriess

1946 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

thefrollickingnerd Over a year ago

Thanks for the advice. It's useful to know joins use hash because I'm trying to maintain speed. I'm effectively trying to add a value to a new column for every unique block and it just seemed like something I should've been able to do in one step. Would using Map between the block id column of each DF be faster or equal to an inner join?

miriess Over a year ago

Joins with pandas are simply in memory and don't need any fancy hashing or so... but if you ever have a dataset that needs distributed computing with spark, then inner joins usually are hash joins if my memory does not fail me. Or at least they are quite fast even in a distributed setting. Edit: It helps in your case even more that one of the datasets will most likely be significantly smaller than the other.

Collectives™ on Stack Overflow

Apply function to multiple columns of a groupby object

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related