Pandas column string method on row functions

Question

I am trying to calculate a new column based on conditions of three other columns using string methods.

Sample data:

d = pd.DataFrame({'street1': ['1000 foo dr', '1001 bar dr', '1002 foo dr suite101', '1003 bar dr'], 
              'street2': ['city_a', np.nan, 'suite 101', 'suite 102'], 
              'city': ['city_a', 'city_b', np.nan, 'city_c']})

street1                 street2     city
1000 foo dr             city_a      city_a
1001 bar dr             NaN         city_b
1002 foo dr suite101    suite 101   NaN
1003 bar dr             suite 102   city_c

ideal output:

Address
1000 foo dr
1001 bar dr
1002 foo dr suite 101
1003 bar dr suite 102

The idea here is

if street2 matches city, ignore
if street2 matches the end of street1, ignore
otherwise, concatenate street1 and street2

What I tried:

def address_clean(row):
    if not row['street2']:
        return row['street1']
    if row['street2'] == row['city']:
        return row['street1']
    elif row['street1'].str.replace(' ', '').find(row['street2'].str.replace(' ', '')) != -1:
        return row['street1']
    else:
        return row['street1'] + row['street2']

d.apply(lambda row: address_clean(row), axis=1).head()

This one throws me an error:

AttributeError: ("'str' object has no attribute 'str'", 'occurred at index 1')

It seems like the row[street1] is a string instead of a pd.Series. However even if I remove the .str part from the original function, which became:

def address_clean(row):
    if not row['street2']:
        return row['street1']
    if row['street2'] == row['city']:
        return row['street1']
    elif row['street1'].replace(' ', '').find(row['street2'].replace(' ', '')) != -1:
        return row['street1']
    else:
        return row['street1'] + row['street2']

d.apply(lambda row: address_clean(row), axis=1).head()

The code throws me the following error:

AttributeError: ("'float' object has no attribute 'replace'", 'occurred at index 1')

I am wondering which part of the function was I using incorrectly, and how to solve this error.

Your second error could be due to having NaN values, type(np.nan) gives float — G. Anderson
– G. Anderson, Commented Mar 25, 2019 at 21:13
You expected output is weird. Where did suite 123 come from? And why was row 3 concatenated with row 4? — Erfan
– Erfan, Commented Mar 25, 2019 at 21:19
@G.Anderson What I don't understand is, if the problem is having NaN values, wouldn't the if not row['street2']: return row['street1'] part of the function handle that properly? Why would it be evaluated in the following if statements? — Xiaoyu Lu
– Xiaoyu Lu, Commented Mar 25, 2019 at 23:11
@Erfan Sorry, was an idiot. I edited the question and made the changes. — Xiaoyu Lu
– Xiaoyu Lu, Commented Mar 25, 2019 at 23:11

Serge Ballesta · Accepted Answer · 2019-03-26 00:01:19Z

It is easy to search a pattern in a series, but I had to use apply to find whether a columns ends with the content of another columns. BTW I had to slightly change your data, because '...suite101' does not end with 'suite 101' except if spaces are to be ignored. So I used:

d = pd.DataFrame({'street1': ['1000 foo dr', '1001 bar dr', '1002 foo dr suite 101', '1003 bar dr'],
                  'street2': ['city_a', np.nan, 'suite 101', 'suite 102'],
                  'city': ['city_a', 'city_b', np.nan, 'city_c']})

print(pd.DataFrame({'Address': np.where(d.street2.str.contains('city', na=True)
               | d.apply(lambda x: x.street1.endswith(str(x.street2)), axis = 1),
               d.street1,
               d.street1.str.cat(d.street2, sep=' '))}))

gives as expected:

                 Address
0            1000 foo dr
1            1001 bar dr
2  1002 foo dr suite 101
3  1003 bar dr suite 102

Collectives™ on Stack Overflow

Pandas column string method on row functions

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related