0

I am trying to calculate a new column based on conditions of three other columns using string methods.

Sample data:

d = pd.DataFrame({'street1': ['1000 foo dr', '1001 bar dr', '1002 foo dr suite101', '1003 bar dr'], 
              'street2': ['city_a', np.nan, 'suite 101', 'suite 102'], 
              'city': ['city_a', 'city_b', np.nan, 'city_c']})

street1                 street2     city
1000 foo dr             city_a      city_a
1001 bar dr             NaN         city_b
1002 foo dr suite101    suite 101   NaN
1003 bar dr             suite 102   city_c

ideal output:

Address
1000 foo dr
1001 bar dr
1002 foo dr suite 101
1003 bar dr suite 102

The idea here is

  • if street2 matches city, ignore
  • if street2 matches the end of street1, ignore
  • otherwise, concatenate street1 and street2

What I tried:

def address_clean(row):
    if not row['street2']:
        return row['street1']
    if row['street2'] == row['city']:
        return row['street1']
    elif row['street1'].str.replace(' ', '').find(row['street2'].str.replace(' ', '')) != -1:
        return row['street1']
    else:
        return row['street1'] + row['street2']

d.apply(lambda row: address_clean(row), axis=1).head()

This one throws me an error:

AttributeError: ("'str' object has no attribute 'str'", 'occurred at index 1')

It seems like the row[street1] is a string instead of a pd.Series. However even if I remove the .str part from the original function, which became:

def address_clean(row):
    if not row['street2']:
        return row['street1']
    if row['street2'] == row['city']:
        return row['street1']
    elif row['street1'].replace(' ', '').find(row['street2'].replace(' ', '')) != -1:
        return row['street1']
    else:
        return row['street1'] + row['street2']

d.apply(lambda row: address_clean(row), axis=1).head()

The code throws me the following error:

AttributeError: ("'float' object has no attribute 'replace'", 'occurred at index 1')

I am wondering which part of the function was I using incorrectly, and how to solve this error.

4
  • Your second error could be due to having NaN values, type(np.nan) gives float Commented Mar 25, 2019 at 21:13
  • You expected output is weird. Where did suite 123 come from? And why was row 3 concatenated with row 4? Commented Mar 25, 2019 at 21:19
  • @G.Anderson What I don't understand is, if the problem is having NaN values, wouldn't the if not row['street2']: return row['street1'] part of the function handle that properly? Why would it be evaluated in the following if statements? Commented Mar 25, 2019 at 23:11
  • @Erfan Sorry, was an idiot. I edited the question and made the changes. Commented Mar 25, 2019 at 23:11

1 Answer 1

1

It is easy to search a pattern in a series, but I had to use apply to find whether a columns ends with the content of another columns. BTW I had to slightly change your data, because '...suite101' does not end with 'suite 101' except if spaces are to be ignored. So I used:

d = pd.DataFrame({'street1': ['1000 foo dr', '1001 bar dr', '1002 foo dr suite 101', '1003 bar dr'],
                  'street2': ['city_a', np.nan, 'suite 101', 'suite 102'],
                  'city': ['city_a', 'city_b', np.nan, 'city_c']})

print(pd.DataFrame({'Address': np.where(d.street2.str.contains('city', na=True)
               | d.apply(lambda x: x.street1.endswith(str(x.street2)), axis = 1),
               d.street1,
               d.street1.str.cat(d.street2, sep=' '))}))

gives as expected:

                 Address
0            1000 foo dr
1            1001 bar dr
2  1002 foo dr suite 101
3  1003 bar dr suite 102
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.