0

I have one data frame which is loads a central file. New files are monthly updated here. Since there are few missing columns in the file which is copied into the data frame, I created a mapping dataframe which adds values to the dataframe when condition is met for the missing columns.

Below is the central file example:

ID	Region	Country	Code	User	Order Price
1		Germany		ABC	2342545
2		Italy		DEF	5464545
3		USA		GHI	3245325
4		India		JKL	676565
5		Mexico		MNO	3443252
6		China		PQR	565445
7		Germany		STU	765765
8		Mexico		VWX	564566
9		China		YZA	346534
10		India		BCD	5675765

This is my mapping file for missing Region and Code

Country	Region	Code
Germany	EU	1
Italy	EU	2
USA	America	3
India	Asia	4
Mexico	America	5
China	Asia	6

Here is the expected output:

ID	Region	Country	Code	User	Order Price
1	EU	Germany	1	ABC	2342545
2	EU	Italy	2	DEF	5464545
3	America	USA	3	GHI	3245325
4	Asia	India	4	JKL	676565
5	America	Mexico	5	MNO	3443252
6	Asia	China	6	PQR	565445
7	EU	Germany	2	STU	765765
8	America	Mexico	5	VWX	564566
9	Asia	China	6	YZA	346534
10	Asia	India	4	BCD	5675765

What I have done is to use for loops with iterrows() to update the values in the data frame.

The problem is it is super slow and it takes about an hour or more to update about 60,000 records.

here is my code:

        for central_update_index, central_update_row in central_bridge_file.iterrows():
        print('index: ', central_update_index)
        for bridge_match_index, bridge_match_row in central_bridge_matching_file.iterrows():
            if bridge_match_row['Country'] == central_update_row['Country']:
                if central_update_row['Country / Company (P2)'] == bridge_match_row['Country']:
                    central_bridge_file.loc[central_update_index, 'Code'] = \
                        bridge_match_row['Code']
                    central_bridge_file.loc[central_update_index, 'Region'] = bridge_match_row[
                        'Region']

Can someone help me in how can I write a lambda statement or something that could do it in mins?

6
  • You could try iterating using itertuples instead, it's a very small change code wise but usually results in quite good speed up Commented Jan 15, 2020 at 14:13
  • @Nathan: Ok I will try that I am actually using it in another loop, but what could be more useful is to have a one-liner statement, I have seen and used one within the same data frame but I could not create one across 2 data frames. Commented Jan 15, 2020 at 14:20
  • Please include your sample data as text, and also include your expected output. Commented Jan 15, 2020 at 14:24
  • are the empty fields empty strings or nans? Commented Jan 15, 2020 at 14:34
  • Updated. Please check Commented Jan 15, 2020 at 15:02

2 Answers 2

2

Give df,

   ID  Region  Country  Code User  Order Price
0   1     NaN  Germany   NaN  ABC      2342545
1   2     NaN    Italy   NaN  DEF      5464545
2   3     NaN      USA   NaN  GHI      3245325
3   4     NaN    India   NaN  JKL       676565
4   5     NaN   Mexico   NaN  MNO      3443252
5   6     NaN    China   NaN  PQR       565445
6   7     NaN  Germany   NaN  STU       765765
7   8     NaN   Mexico   NaN  VWX       564566
8   9     NaN    China   NaN  YZA       346534
9  10     NaN    India   NaN  BCD      5675765

and df_map,

   Country   Region  Code
0  Germany       EU     1
1    Italy       EU     2
2      USA  America     3
3    India     Asia     4
4   Mexico  America     5
5    China     Asia     6

You can merge these two dataframes on 'Country':

df[['ID','Country','User','Order Price']].merge(df_map)

Output:

   ID  Country User  Order Price   Region  Code
0   1  Germany  ABC      2342545       EU     1
1   7  Germany  STU       765765       EU     1
2   2    Italy  DEF      5464545       EU     2
3   3      USA  GHI      3245325  America     3
4   4    India  JKL       676565     Asia     4
5  10    India  BCD      5675765     Asia     4
6   5   Mexico  MNO      3443252  America     5
7   8   Mexico  VWX       564566  America     5
8   6    China  PQR       565445     Asia     6
9   9    China  YZA       346534     Asia     6
Sign up to request clarification or add additional context in comments.

3 Comments

No, it doesn't work, it didn't update anything. Do you think I should mention only Country or all the columns other than the one to be updated?
Well, that command isn't inplace you need to reassign back to df. Try this, df=df[['ID','Country','User','Order Price']].merge(df_map) Does the output look correct in this answer?
Good Answer works really well, only thing, I need to mention all the column names I want to keep for the final dataframe.
0

If you want to totally replace the Region and Code columns in your data, you can do merge:

df = (df.drop(['Region','Code'], axis=1)
        .merge(mapping, on='Country', how='left')
     )

If you only want to update those columns, e.g. keeping the old values, then

mapping = mapping.set_index('Country')

for c in ['Region', 'Code']:
    df[c] = df[c].fillna(df['Country'].map(mapping[c]))

Output:

   ID   Region  Country  Code User  Order Price
0   1       EU  Germany   1.0  ABC      2342545
1   2       EU    Italy   2.0  DEF      5464545
2   3  America      USA   3.0  GHI      3245325
3   4     Asia    India   4.0  JKL       676565
4   5  America   Mexico   5.0  MNO      3443252
5   6     Asia    China   6.0  PQR       565445
6   7       EU  Germany   1.0  STU       765765
7   8  America   Mexico   5.0  VWX       564566
8   9     Asia    China   6.0  YZA       346534
9  10     Asia    India   4.0  BCD      5675765

1 Comment

What do you mean by mapping?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.