pandas.Series.str.replace() based on another series

Question

I am working with a dataset of used cars made available in a DataQuest guided project (https://www.dataquest.io/m/294/guided-project%3A-exploring-ebay-car-sales-data/). I have provided a sample of the data for this question.

What I am trying to do is remove redundant information from the car name such as brand name. Brand is already contained in another column in the data, and the exercise is working with pandas for data cleaning, so I want to see if there is a clean way to replace such substrings with the library's functionality. I have tried passing a pandas Series as the pat argument in Series.str.replace(), but obviously it won't work. What's a clean way to perform a vectorized replacement on a pandas Series based on another Series?

Ideally 'Peugeot_807_160_NAVTECH_ON_BOARD' would become '_807_160_NAVTECH_ON_BOARD' and so on.

import pandas as pd

autos_dict = {
    'brand': ['peugeot', 'bmw', 'volkswagen', 'smart', 'chrysler'],
    'name': [
        'Peugeot_807_160_NAVTECH_ON_BOARD',
        'BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik',
        'Volkswagen_Golf_1.6_United',
        'Smart_smart_fortwo_coupe_softouch/F1/Klima/Panorama',
        'Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Stow´n_Go_Sitze_7Sitze'
    ]
}

autos_df = pd.DataFrame.from_dict(autos_dict)
autos_df['name'].str.replace(autos_df['brand'], '', case=False)

The following error message is returned:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda3/lib/python3.6/site-packages/pandas/core/strings.py", line 2429, in replace
    flags=flags, regex=regex)
  File "/anaconda3/lib/python3.6/site-packages/pandas/core/strings.py", line 656, in str_replace
    compiled = re.compile(pat, flags=flags)
  File "/anaconda3/lib/python3.6/re.py", line 233, in compile
    return _compile(pattern, flags)
  File "/anaconda3/lib/python3.6/re.py", line 289, in _compile
    p, loc = _cache[type(pattern), pattern, flags]
  File "/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 1489, in __hash__
    ' hashed'.format(self.__class__.__name__))
TypeError: 'Series' objects are mutable, thus they cannot be hashed

I would be fine doing this with raw Python, so please respond only if you have a pandas-based solution.

Ashish Acharya · Accepted Answer · 2018-08-15 00:09:30Z

2

You might be able to do this with the apply function:

In [6]: def replace_brand(row):
   ...:     return row['name'].lower().replace(row['brand'], '')
   ...: 

In [8]: autos_df['name'] = autos_df.apply(lambda row: replace_brand(row), axis=1)

In [9]: autos_df
Out[9]: 
        brand                                               name
0     peugeot                          _807_160_navtech_on_board
1         bmw            _740i_4_4_liter_hamann_umbau_mega_optik
2  volkswagen                                   _golf_1.6_united
3       smart          __fortwo_coupe_softouch/f1/klima/panorama
4    chrysler  _grand_voyager_2.8_crd_aut.limited_stow´n_go_s...

answered Aug 15, 2018 at 0:09

Ashish Acharya

3,4091 gold badge19 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

rafaelc · Accepted Answer · 2018-08-15 00:10:17Z

1

Without apply

r = {v: '' for _, v in df.brand.to_dict().items()}
df.name.str.lower().replace(r, regex=True)

Outputs

0                            _807_160_navtech_on_board
1              _740i_4_4_liter_hamann_umbau_mega_optik
2                                     _golf_1.6_united
3            __fortwo_coupe_softouch/f1/klima/panorama
4    _grand_voyager_2.8_crd_aut.limited_stow´n_go_s...
Name: name, dtype: object

answered Aug 15, 2018 at 0:10

rafaelc

59.4k15 gold badges64 silver badges87 bronze badges

Comments

Onyambu · Accepted Answer · 2018-08-15 00:24:45Z

1

You just need to use a regular expression that ignores the cases ie (?i) is same as re.I. Since df.replace does not take a flag argument, you will invoke this manually. This ensures that all the other characters are maintained the way they were before deletion. Ie if they were capital, they remain so and vice versa

autos_df.name.replace(regex=r'(?i)'+ autos_df.brand,value="")
Out[1726]: 
0                            _807_160_NAVTECH_ON_BOARD
1              _740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik
2                                     _Golf_1.6_United
3            __fortwo_coupe_softouch/F1/Klima/Panorama
4    _Grand_Voyager_2.8_CRD_Aut.Limited_Stow´n_Go_S...
Name: name, dtype: object

edited Aug 15, 2018 at 0:24

answered Aug 15, 2018 at 0:22

Onyambu

80.3k3 gold badges29 silver badges65 bronze badges

1 Comment

rafaelc Over a year ago

This is the best solution ;) + 1

Collectives™ on Stack Overflow

pandas.Series.str.replace() based on another series

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related