4

I am working with fairly messy data: a tariff table with the following form:

import pandas as pd
import numpy as np

data1 = np.array([u'Free (A, B, KR, FR), 5% (JP)', u'Free (A, B, FR), 5% (JP, KR))'])
data2 = np.array(['10101010', '10101020'])
data = {'hscode': data2, 'tariff' : data1}

df = pd.DataFrame(data, columns=['hscode', 'tariff'])

The first row shows that the tariff is zero for countries (A, B, KR, FR) and 5% for JP, and the second row shows that it is zero for A, B, FR while 5 % for JP and KR.

I want to find the tariff rate of country 'KR' for each row, so that I could have the following table:

'hscode' 'tariff'

10101010 0%

10101020 5%

So, I want to find the tariff rate for the county code 'KR' in each cell.

4
  • can you explain more clearly how's that data2 related to data1 and what's relationship with KR and (A,B,KR, FR)? Commented Oct 16, 2015 at 12:37
  • Hi Anzel, data2 is the 'hamonized tariff code' and the data1 shows the actual tariff rate for each countries. (A, B, KR, FR, JP) all denote countires, and I want to find the tariff rate for a specific country, KR. Thanks. Commented Oct 16, 2015 at 12:54
  • I just posted an answer that does not use regular expressions. Are regular expressions mandatory? Because you just state them in the title. Commented Oct 16, 2015 at 13:02
  • Thanks, Fabian. I am trying to study re, as I encounter this kind of messy text data frequently. Commented Oct 16, 2015 at 14:26

2 Answers 2

2

You may use apply with regex:

## -- End pasted text --

In [133]: import re

In [134]: df
Out[134]: 
     hscode                         tariff
0  10101010   Free (A, B, KR, FR), 5% (JP)
1  10101020  Free (A, B, FR), 5% (JP, KR))

In [135]: df['tariff'].apply(lambda x: ''.join(re.findall(r'.*(Free|\d+%).*\bKR\b', x)))
Out[135]: 
0    Free
1      5%
Name: tariff, dtype: object

Explain: within tariff, capture either "Free" or "x%" if string contains "KR".

You may create a function to dynamically set "KR" as a lookup variable.

Sign up to request clarification or add additional context in comments.

5 Comments

Anzel, you are my angel! This works very well, but I still need to study regular expressions. I don't understand how this code works. It looks like it is trying to find (Free|\d+%) in x, but I don't know what two stars(*) and a dot(.) do in this expression and. I also don't know how this code handles the parenthesis. But thanks so much! I will study this more!!
@JohnShin Not a problem :) the .* (dot star) means zero or more of any characters (which is anything really) and \b[string]\b means the whole thing must contains the full [string], take it as startswith - endswith
@JohnShin the regex I used is basically a pattern matching saying OK -- with or without anything at front .* without brackets (i am not capturing it), it has to be either "Free" or "x%" with brackets (I want to capture the value), again there can be anything between the strings it must contain, ie. "KR"... so at the end you're only capturing (...) <- the thing in brackets
Thank you for your explanations. It looks like regular expression is more powerful than I imagined. Thank you so much!!
@JohnShin, it is, but always try to solve this in the simplest way you can think of, before moving to regex. As sometimes simple if/then will do just fine and you can rid of unnecessary overhead
0
    import pandas as pd
    import numpy as np

    data1 = np.array([u'Free (A, B, KR, FR), 5% (JP)', u'Free (A, B, FR), 5% (JP, KR))'])
    data2 = np.array(['10101010', '10101020'])

    df = []
    for i, element in enumerate(data1):
        free, five = element.lstrip('Free (').rstrip(')').split('), 5% (')
        for country in free.split(', '):
            row = [data2[i], 'Free', country]
            df.append(row)
        for country in five.split(', '):
            row = [data2[i], '5%', country]
            df.append(row)
    df = pd.DataFrame(df, columns = ['hscode', 'tariff', 'country'])
    print df.query('country == "KR"')

gives

     hscode tariff country
2  10101010   Free      KR
9  10101020     5%      KR

1 Comment

Thanks for the answer. But, what if there are multiple tariff rates, i.e., if could be 4%, 5%, 1%, etc... I have rows more than 20,000. Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.