Conditional Regex Function on Pandas Dataframe

Question

I have the following df and function (see below). I might be over complicating this. A new set of fresh eyes would be deeply appreciated.

df:

Site Name   Plan Unique ID  Atlas Placement ID
Affectv     we11080301      11087207850894
Mashable    we14880202      11087208009031
Alphr       uk10790301      11087208005229
Alphr       uk19350201      11087208005228

The goal is to:

Iter first through df['Plan Unique ID'], search for a specific value (we_match or uk_match), if there is a match
Check that the string value is bigger than a certain value in that group (we12720203 or uk11350200)
If the value is greater than add that we or uk value to a new column df['Consolidated ID'].
If the value is lower or there is no match, then search df['Atlas Placement ID'] with new_id_search
If there is a match, then add that to df['Consolidated ID']
If not, return 0 to df['Consolidated ID]

The current problem is that it returns an empty column.

 def placement_extract(df="mediaplan_df", we_search="we\d{8}", uk_search="uk\d{8}", new_id_search= "(\d{14})"):

        if type(df['Plan Unique ID']) is str:
            we_match = re.search(we_search, df['Plan Unique ID'])
            if we_match:
                if we_match > "we12720203":
                    return we_match.group(0)
                else:
                    uk_match =  re.search(uk_search, df['Plan Unique ID'])
                    if uk_match:
                        if uk_match > "uk11350200":
                            return uk_match.group(0)
                        else:
                            match_new =  re.search(new_id_search, df['Atlas Placement ID'])
                            if match_new:
                                return match_new.group(0)

                            return 0


    mediaplan_df['Consolidated ID'] = mediaplan_df.apply(placement_extract, axis=1)

Edit: Cleaned the formula

I modified gzl's function in the following way (see below): First see if in df1 there is 14 numbers. If so, add that.

The next step, ideally would be to grab a column MediaPlanUnique from df2 and turn it into a series filtered_placements:

And see if any of the values in filtered_placements are present in df['Plan Unique ID]. If there is a match, then add df['Plan Unique ID] to our end column = df[ConsolidatedID]

The current problem is that it results in all 0. I think it's because the comparison is been done as 1 to 1 (first result of new_match vs first result of filtered_placements) rather than 1 to many (first result of new_match vs all results of filtered_placements)

Any ideas?

def placement_extract(df="mediaplan_df", new_id_search="[a-zA-Z]{2}\d{8}", old_id_search= "(\d{14})"):

    if type(df['PlacementID']) is str:

        old_match =  re.search(old_id_search, df['PlacementID'])
        if old_match:
            return old_match.group(0)

        else:

            if type(df['Plan Unique ID']) is str:
                if type(filtered_placements) is str:


                    new_match = re.search(new_id_search, df['Plan Unique ID'])
                    if new_match:
                        if filtered_placements.str.contains(new_match.group(0)):
                            return new_match.group(0)          


        return 0

mediaplan_df['ConsolidatedID'] = mediaplan_df.apply(placement_extract, axis=1)

Any chance you could supply a sample of the data @Matt so that we validate against it? — Phil Sheard
– Phil Sheard, Commented Apr 4, 2017 at 14:37
Thanks. Also, regarding the first 'if': is there a scenario where the 'Plan Unique ID' would be anything other than a string? i.e. is this just an error check or do you explicitly want to leave non-string values (e.g. integers) alone. — Phil Sheard
– Phil Sheard, Commented Apr 4, 2017 at 14:42
Thanks Matt. Logic does seem complicated so I'm looking at it from a different angle and making a few different passes, as each one is mutually-exclusive so no risk of overwriting it. Bear with me... — Phil Sheard
– Phil Sheard, Commented Apr 4, 2017 at 14:49
Hi Matt. If either of the answers solved your problem, would you be able to mark one as the answer so either I or @gzc get the result added to our profile. Thanks! — Phil Sheard
– Phil Sheard, Commented Apr 5, 2017 at 9:01

gzc · Accepted Answer · 2017-04-04 15:08:23Z

1

I would recommend that don't use such complicate nested if statements. As Phil pointed out, each check is mutually-exclusive. Thus you can check 'we' and 'uk' in same indented if statement, then fall back to default process.

def placement_extract(df="mediaplan_df", we_search="we\d{8}", uk_search="uk\d{8}", new_id_search= "(\d{14})"):

    if type(df['Plan Unique ID']) is str:
        we_match = re.search(we_search, df['Plan Unique ID'])
        if we_match:
            if we_match.group(0) > "we12720203":
                return we_match.group(0)

        uk_match =  re.search(uk_search, df['Plan Unique ID'])
        if uk_match:
            if uk_match.group(0) > "uk11350200":
                return uk_match.group(0)


        match_new =  re.search(new_id_search, df['Atlas Placement ID'])

        if match_new:
            return match_new.group(0)

        return 0

Test:

In [37]: df.apply(placement_extract, axis=1)
Out[37]:
0    11087207850894
1        we14880202
2    11087208005229
3        uk19350201
dtype: object

answered Apr 4, 2017 at 15:08

gzc

8,6998 gold badges45 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Matt M Over a year ago

Hi gzc, I added an extra variable in the edit. I think I identified the problem, but am stuck on how to solve it. Any help would be appreciated

Phil Sheard · Accepted Answer · 2017-04-04 15:49:59Z

I've reorganised the logic and also simpified the regex operations to show another way to approach it. The reorganisation wasn't strictly necessary for the answer but as you asked for another opinion / way of approaching it I thought this might help you in future:

# Inline comments to explain the main changes.
def placement_extract(row, we_search="we12720203", uk_search="uk11350200"):
    # Extracted to shorter temp variable
    plan_id = row["Plan Unique ID"]
    # Using parenthesis to get two separate groups - code and numeric
    # Means you can do the match just once
    result = re.match("(we|uk)(.+)",plan_id)
    if result:
        code, numeric = result.groups()
        # We can get away with these simple tests as the earlier regex guarantees
        # that the string starts with either "we" or "uk"
        if code == "we" and plan_id > we_search:
            return_val = plan_id
        elif code == "uk" and plan_id > uk_search:
            return_val = plan_id
        else:
            # It looked like this column was used whatever happened at the
            # end, so there's no need to check against a regex
            #
            # The Atlas Placement is the default option if it either fails
            # the prefix check OR the "greater than" test
            return_val = row["Atlas Placement ID"]
    # A single return statement is often easier to debug
    return return_val

Then using in an apply statement (also look into assign):

$ mediaplan_df["Consolidated ID"] = mediaplan_df.apply(placement_extract, axis=1)
$ mediaplan_df
>   
Site Name Plan Unique ID Atlas Placement ID Consolidated ID
0   Affectv     we11080301     11087207850894  11087207850894
1  Mashable     we14880202     11087208009031      we14880202
2     Alphr     uk10790301     11087208005229  11087208005229
3     Alphr     uk19350201     11087208005228      uk19350201

You could probably get away with replacing the regex operations with simpler string operations (e.g. if plan_id.startswith("we"): etc) but that's another step beyond the scope of your question.

Collectives™ on Stack Overflow

Conditional Regex Function on Pandas Dataframe

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related