3

I have a dataframe mixed state and region together. Those value have [edit] means states in US.

    RegionName
0   Alabama[edit]
1   Auburn [1]
2   Florence
3   Jacksonville [2]
4   Livingston [2]
5   Montevallo [2]
6   Troy [2]
7   Tuscaloosa [3][4]
8   Tuskegee [5]
9   Alaska[edit]    

The result I want is

    State               RegionName
0   Alabama[edit]       Auburn[1]
1                       Florence
2                       Jacksonville [2]
3                          ...
4   Alaska[edit]           ...   

I tried to use the code below but it failed

for row in df.iterrows():
    if row['RegionName'][-6:] == '[edit]':
        row['state'] = row[:-6]

The error message is

TypeError: tuple indices must be integers or slices, not str

Could someone give me some advice? Thank

1
  • Oh!! Thank you very much. It really helps a lot. Commented Feb 8, 2017 at 9:42

1 Answer 1

3

You can use mask, for select last six chars indexing with str:

mask = df.RegionName.str[-6:] != '[edit]'
print (mask)
0    False
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9    False
Name: RegionName, dtype: bool

#filter by mask and replace NaN by forward filling
df['State'] = df.RegionName.mask(mask).ffill()
#remove same values in both columns
df = df[df.State != df.RegionName]
print (df)
          RegionName          State
1         Auburn [1]  Alabama[edit]
2           Florence  Alabama[edit]
3   Jacksonville [2]  Alabama[edit]
4     Livingston [2]  Alabama[edit]
5     Montevallo [2]  Alabama[edit]
6           Troy [2]  Alabama[edit]
7  Tuscaloosa [3][4]  Alabama[edit]
8       Tuskegee [5]  Alabama[edit]
#keep only first duplicates, another replace by empty string
df['State'] = df.State.mask(df.State.duplicated(), '')
#change order of columns
df = df[['State','RegionName']].reset_index(drop=True)
print (df)
           State         RegionName
0  Alabama[edit]         Auburn [1]
1                          Florence
2                  Jacksonville [2]
3                    Livingston [2]
4                    Montevallo [2]
5                          Troy [2]
6                 Tuscaloosa [3][4]
7                      Tuskegee [5]

But if need remove [] and numbers is possible use a bit modified answer:

df.insert(0, 'State', df['RegionName'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['RegionName'].str.contains('\[edit\]')].reset_index(drop=True)
#change ( to [
df['RegionName'] = df['RegionName'].str.replace(r' \[.+$', '')
print (df)
     State    RegionName
0  Alabama        Auburn
1  Alabama      Florence
2  Alabama  Jacksonville
3  Alabama    Livingston
4  Alabama    Montevallo
5  Alabama          Troy
6  Alabama    Tuscaloosa
7  Alabama      Tuskegee

df['State'] = df.State.mask(df.State.duplicated(), '')
print (df)
     State    RegionName
0  Alabama        Auburn
1               Florence
2           Jacksonville
3             Livingston
4             Montevallo
5                   Troy
6             Tuscaloosa
7               Tuskegee

EDIT by comment:

There are multiple problems, if need very slow loop solution:

#add i for index value else get tuples
for i, row in df.iterrows():
    print (row)
    if row['RegionName'][-6:] == '[edit]':
        #for appending new column with values use loc 
        df.loc[i, 'state'] = row['RegionName'][:-6]

print (df)
         RegionName    state
0     Alabama[edit]  Alabama
1        Auburn [1]      NaN
2          Florence      NaN
3  Jacksonville [2]      NaN
4    Livingston [2]      NaN
5    Montevallo [2]      NaN
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you very much! It really help. Also, do you have any idea about my error message?
Please check edited answer.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.