Regular expression to filter desired rows from pandas dataframe

Question

I am working with fairly messy data: a tariff table with the following form:

import pandas as pd
import numpy as np

data1 = np.array([u'Free (A, B, KR, FR), 5% (JP)', u'Free (A, B, FR), 5% (JP, KR))'])
data2 = np.array(['10101010', '10101020'])
data = {'hscode': data2, 'tariff' : data1}

df = pd.DataFrame(data, columns=['hscode', 'tariff'])

The first row shows that the tariff is zero for countries (A, B, KR, FR) and 5% for JP, and the second row shows that it is zero for A, B, FR while 5 % for JP and KR.

I want to find the tariff rate of country 'KR' for each row, so that I could have the following table:

'hscode' 'tariff'

10101010 0%

10101020 5%

So, I want to find the tariff rate for the county code 'KR' in each cell.

can you explain more clearly how's that data2 related to data1 and what's relationship with KR and (A,B,KR, FR)? — Anzel
– Anzel, Commented Oct 16, 2015 at 12:37
Hi Anzel, data2 is the 'hamonized tariff code' and the data1 shows the actual tariff rate for each countries. (A, B, KR, FR, JP) all denote countires, and I want to find the tariff rate for a specific country, KR. Thanks. — John Shin
– John Shin, Commented Oct 16, 2015 at 12:54
I just posted an answer that does not use regular expressions. Are regular expressions mandatory? Because you just state them in the title. — Fabian Rost
– Fabian Rost, Commented Oct 16, 2015 at 13:02
Thanks, Fabian. I am trying to study re, as I encounter this kind of messy text data frequently. — John Shin
– John Shin, Commented Oct 16, 2015 at 14:26

Anzel · Accepted Answer · 2015-10-16 13:31:27Z

2

You may use apply with regex:

## -- End pasted text --

In [133]: import re

In [134]: df
Out[134]: 
     hscode                         tariff
0  10101010   Free (A, B, KR, FR), 5% (JP)
1  10101020  Free (A, B, FR), 5% (JP, KR))

In [135]: df['tariff'].apply(lambda x: ''.join(re.findall(r'.*(Free|\d+%).*\bKR\b', x)))
Out[135]: 
0    Free
1      5%
Name: tariff, dtype: object

Explain: within tariff, capture either "Free" or "x%" if string contains "KR".

You may create a function to dynamically set "KR" as a lookup variable.

answered Oct 16, 2015 at 13:31

Anzel

20.6k5 gold badges54 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

John Shin Over a year ago

Anzel, you are my angel! This works very well, but I still need to study regular expressions. I don't understand how this code works. It looks like it is trying to find (Free|\d+%) in x, but I don't know what two stars(*) and a dot(.) do in this expression and. I also don't know how this code handles the parenthesis. But thanks so much! I will study this more!!

Anzel Over a year ago

@JohnShin Not a problem :) the .* (dot star) means zero or more of any characters (which is anything really) and \b[string]\b means the whole thing must contains the full [string], take it as startswith - endswith

Anzel Over a year ago

@JohnShin the regex I used is basically a pattern matching saying OK -- with or without anything at front .* without brackets (i am not capturing it), it has to be either "Free" or "x%" with brackets (I want to capture the value), again there can be anything between the strings it must contain, ie. "KR"... so at the end you're only capturing (...) <- the thing in brackets

John Shin Over a year ago

Thank you for your explanations. It looks like regular expression is more powerful than I imagined. Thank you so much!!

Anzel Over a year ago

@JohnShin, it is, but always try to solve this in the simplest way you can think of, before moving to regex. As sometimes simple if/then will do just fine and you can rid of unnecessary overhead

Fabian Rost · Accepted Answer · 2015-10-16 13:00:22Z

0

    import pandas as pd
    import numpy as np

    data1 = np.array([u'Free (A, B, KR, FR), 5% (JP)', u'Free (A, B, FR), 5% (JP, KR))'])
    data2 = np.array(['10101010', '10101020'])

    df = []
    for i, element in enumerate(data1):
        free, five = element.lstrip('Free (').rstrip(')').split('), 5% (')
        for country in free.split(', '):
            row = [data2[i], 'Free', country]
            df.append(row)
        for country in five.split(', '):
            row = [data2[i], '5%', country]
            df.append(row)
    df = pd.DataFrame(df, columns = ['hscode', 'tariff', 'country'])
    print df.query('country == "KR"')

gives

     hscode tariff country
2  10101010   Free      KR
9  10101020     5%      KR

answered Oct 16, 2015 at 13:00

Fabian Rost

2,4332 gold badges17 silver badges27 bronze badges

1 Comment

John Shin Over a year ago

Thanks for the answer. But, what if there are multiple tariff rates, i.e., if could be 4%, 5%, 1%, etc... I have rows more than 20,000. Thanks.

Collectives™ on Stack Overflow

Regular expression to filter desired rows from pandas dataframe

2 Answers 2

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related