4

I have a scenario where I need to transform the values of a particular column based on the value present in another column in the same row, and the value in another dataframe.

Example-

print(parent_df)
       school         location      modifed_date
0      school_1       New Delhi     2020-04-06
1      school_2       Kolkata       2020-04-06
2      school_3       Bengaluru     2020-04-06
3      school_4       Mumbai        2020-04-06
4      school_5       Chennai       2020-04-06

print(location_df)
       school          location     
0      school_10       New Delhi
1      school_20       Kolkata     
2      school_30       Bengaluru
3      school_40       Mumbai       
4      school_50       Chennai

As per this use case, I need to transform the school names present in parent_df, based on the location column present in the same df, and the location property present in location_df

To achieve this transformation, I wrote the following method.

def transform_school_name(row, location_df):
    name_alias = location_df[location_df['location'] == row['location']]
    if len(name_alias) > 0:
        return location_df.school.iloc[0]
    else:
        return row['school']

And this is how I am calling this method

parent_df['school'] = parent_df.apply(UtilityMethods.transform_school_name, args=(self.location_df,), axis=1)

The issue is that for just 46K records, I am seeing the entire tranformation happening in around 2 mins, which is too slow. How can I improve the performance of this solution?

EDITED

Following is the actual scenario I am dealing with wherein there is a small tranformation that is needed to be done before we can replace the value in the original column. I am not sure if this can be done within replace() method as mentioned in one of the answers below.

print(parent_df)
       school         location                  modifed_date    type
0      school_1       _pre_New Delhi_post       2020-04-06      Govt
1      school_2       _pre_Kolkata_post         2020-04-06      Private
2      school_3       _pre_Bengaluru_post       2020-04-06      Private
3      school_4       _pre_Mumbai_post          2020-04-06      Govt
4      school_5       _pre_Chennai_post         2020-04-06      Private

print(location_df)
           school          location     type
    0      school_10       New Delhi    Govt
    1      school_20       Kolkata      Private
    2      school_30       Bengaluru    Private

Custom Method code

def transform_school_name(row, location_df):
location_values = row['location'].split('_')
name_alias = location_df[location_df['location'] == location_values[1]]
name_alias = name_alias[name_alias['type'] == location_df['type']]
if len(name_alias) > 0:
    return location_df.school.iloc[0]
else:
    return row['school']


def transform_school_name(row, location_df):
    name_alias = location_df[location_df['location'] == row['location']]
    if len(name_alias) > 0:
        return location_df.school.iloc[0]
    else:
        return row['school']

This is the actual scenario what I need to handle, so using replace() method won't help.

0

4 Answers 4

3

You can use map/replace:

parent_df['school'] = parent_df.location.replace(location_df.set_index('location')['school'])

Output:

      school   location modifed_date
0  school_10  New Delhi   2020-04-06
1  school_20    Kolkata   2020-04-06
2  school_30  Bengaluru   2020-04-06
3  school_40     Mumbai   2020-04-06
4  school_50    Chennai   2020-04-06
Sign up to request clarification or add additional context in comments.

7 Comments

What about the cases where the values don't exist? In that case, I need to keep the original value intact.
@jezrael that's just update only if the location exists in location_df, hence replace instead of map.
I think there are duplicates, like seems seelcted only one school
@MitakshGupta that's what replace does. Only update the existing values, leave the non-existent intact. map needs an extra fillna followed.
reopened, not match.
|
2
+25

IIUC, this is more of a regex issue as the pattern doesn't match exactly. First extract the required pattern, create mapping of location in parent_df to location_df, map the values.

pat =  '.*?' + '(' + '|'.join(location_df['location']) + ')' + '.*?' 

mapping = parent_df['location'].str.extract(pat)[0].map(location_df.set_index('location')['school'])

parent_df['school'] = mapping.combine_first(parent_df['school'])
parent_df


    school      location            modifed_date    type
0   school_10   _pre_New Delhi_post 2020-04-06      Govt
1   school_20   _pre_Kolkata_post   2020-04-06      Private
2   school_30   _pre_Bengaluru_post 2020-04-06      Private
3   school_4    _pre_Mumbai_post    2020-04-06      Govt
4   school_5    _pre_Chennai_post   2020-04-06      Private

Comments

2

As I understood the edited task, the following update is to be performed:

  • for each row in parent_df,
  • find a row in location_df with matching location (a part of location column and type),
  • if found, overwrite school column in parent_df with school from the row just found.

To do it, proceed as follows:

Step 1: Generate a MultiIndex to locate school names by city and school type:

ind = pd.MultiIndex.from_arrays([parent_df.location.str
    .split('_', expand=True)[2], parent_df.type])

For your sample data, the result is:

MultiIndex([('New Delhi',    'Govt'),
            (  'Kolkata', 'Private'),
            ('Bengaluru', 'Private'),
            (   'Mumbai',    'Govt'),
            (  'Chennai', 'Private')],
           names=[2, 'type'])

Don't worry about strange first level column name (2), it will disappear soon.

Step 2: Generate a list of "new" locations:

locList = location_df.set_index(['location', 'type']).school[ind].tolist()

The result is:

['school_10', 'school_20', 'school_30', nan, nan]

For first 3 schools something has been found, for last 2 - nothing.

Step 3: Perform the actual update with the above list, through "non-null" mask:

parent_df.school = parent_df.school.mask(pd.notnull(locList), locList)

Execution speed

Due to usage of vectorized operations and lookup by index, my code runs sustantially faster that apply to each row.

Example: I replicated your parent_df 10,000 times and checked with %timeit the execution time of your code (actually, a bit changed version, described below) and mine.

To allow repeated execution, I changed both versions so that they set school_2 column, and school remains unchanged.

Your code was running 34.9 s, whereas my code - only 161 ms - 261 times faster.

Yet quicker version

If parent_df has a default index (consecutive numbers starting from 0), then the whole operation can be performed with a single instruction:

parent_df.school = location_df.set_index(['location', 'type']).school[
    pd.MultiIndex.from_arrays(
        [parent_df.location.str.split('_', expand=True)[2],
         parent_df.type])
    ]\
    .reset_index(drop=True)\
    .combine_first(parent_df.school)

Steps:

  • location_df.set_index(...) - Set index to 2 "criterion" columns.
  • .school - Leave only school column (with the above index).
  • [...] - Retrieve from it elements indicated by the MultiIndex defined inside.
  • pd.MultiIndex.from_arrays( - Create the MultiIndex.
  • parent_df.location.str.split('_', expand=True)[2] - The first level of the MultiIndex - the "city" part from location.
  • parent_df.type - The second level of the MultiIndex - type.
  • reset_index(...) - Change the MultiIndex into a default index (now the index is just the same as in parent_df.
  • combine_first(...) - Overwrite NaN values in the result generated so far with original values from school.
  • parent_df.school = - Save the result back in school column. For the test purpose, to check the execution speed, you can change it with parent_df['school_2'].

According to my assessment, the execution time is by 9 % shorter than for my original solution.

Corrections to your code

  1. Take a look at location_values[1]]. It retrieves pre segment, whereas actually the next segment (city name) should be retrieved.

  2. Ther is no need to create a temporary list, based on the first condition and then narrow down it, filtering with the second condition. Both your conditions (for equality of location and type) can be performed in a single instruction, so that execution time is a bit shorter.

  3. The value returned in the "positive" case should be from name_alias, not location_df.

So if for some reason you wanted to stay by your code, change the respective fragment to:

name_alias = location_df[location_df['location'].eq(location_values[2]) &
    location_df['type'].eq(row.type)]
if len(name_alias) > 0:
    return name_alias.school.iloc[0]
else:
    return row['school']

Comments

0

If I'm reading the question correctly, what you're implementing with the apply method is a kind of join operation. Pandas excels are vectorizing operations, plus its c-based implementation of join ('merge') is almost certainly faster than a python / apply based one. Therefore, I would try to use the following solution:

parent_df["location_short"] = parent_df.location.str.split("_", expand=True)[2]
parent_df = pd.merge(parent_df, location_df, how = "left", left_on=["location_short", "type"], 
                     right_on=["location", "type"], suffixes = ["", "_by_location"])

parent_df.loc[parent_df.school_by_location.notna(), "school"] = \
      parent_df.loc[parent_df.school_by_location.notna(), "school_by_location"]

As far as I could understand, it produces what you're looking for:

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.