0

I am working with a dataset of used cars made available in a DataQuest guided project (https://www.dataquest.io/m/294/guided-project%3A-exploring-ebay-car-sales-data/). I have provided a sample of the data for this question.

What I am trying to do is remove redundant information from the car name such as brand name. Brand is already contained in another column in the data, and the exercise is working with pandas for data cleaning, so I want to see if there is a clean way to replace such substrings with the library's functionality. I have tried passing a pandas Series as the pat argument in Series.str.replace(), but obviously it won't work. What's a clean way to perform a vectorized replacement on a pandas Series based on another Series?

Ideally 'Peugeot_807_160_NAVTECH_ON_BOARD' would become '_807_160_NAVTECH_ON_BOARD' and so on.

import pandas as pd

autos_dict = {
    'brand': ['peugeot', 'bmw', 'volkswagen', 'smart', 'chrysler'],
    'name': [
        'Peugeot_807_160_NAVTECH_ON_BOARD',
        'BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik',
        'Volkswagen_Golf_1.6_United',
        'Smart_smart_fortwo_coupe_softouch/F1/Klima/Panorama',
        'Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Stow´n_Go_Sitze_7Sitze'
    ]
}

autos_df = pd.DataFrame.from_dict(autos_dict)
autos_df['name'].str.replace(autos_df['brand'], '', case=False)

The following error message is returned:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda3/lib/python3.6/site-packages/pandas/core/strings.py", line 2429, in replace
    flags=flags, regex=regex)
  File "/anaconda3/lib/python3.6/site-packages/pandas/core/strings.py", line 656, in str_replace
    compiled = re.compile(pat, flags=flags)
  File "/anaconda3/lib/python3.6/re.py", line 233, in compile
    return _compile(pattern, flags)
  File "/anaconda3/lib/python3.6/re.py", line 289, in _compile
    p, loc = _cache[type(pattern), pattern, flags]
  File "/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 1489, in __hash__
    ' hashed'.format(self.__class__.__name__))
TypeError: 'Series' objects are mutable, thus they cannot be hashed

I would be fine doing this with raw Python, so please respond only if you have a pandas-based solution.

3 Answers 3

2

You might be able to do this with the apply function:

In [6]: def replace_brand(row):
   ...:     return row['name'].lower().replace(row['brand'], '')
   ...: 

In [8]: autos_df['name'] = autos_df.apply(lambda row: replace_brand(row), axis=1)

In [9]: autos_df
Out[9]: 
        brand                                               name
0     peugeot                          _807_160_navtech_on_board
1         bmw            _740i_4_4_liter_hamann_umbau_mega_optik
2  volkswagen                                   _golf_1.6_united
3       smart          __fortwo_coupe_softouch/f1/klima/panorama
4    chrysler  _grand_voyager_2.8_crd_aut.limited_stow´n_go_s...
Sign up to request clarification or add additional context in comments.

Comments

1

Without apply

r = {v: '' for _, v in df.brand.to_dict().items()}
df.name.str.lower().replace(r, regex=True)

Outputs

0                            _807_160_navtech_on_board
1              _740i_4_4_liter_hamann_umbau_mega_optik
2                                     _golf_1.6_united
3            __fortwo_coupe_softouch/f1/klima/panorama
4    _grand_voyager_2.8_crd_aut.limited_stow´n_go_s...
Name: name, dtype: object

Comments

1

You just need to use a regular expression that ignores the cases ie (?i) is same as re.I. Since df.replace does not take a flag argument, you will invoke this manually. This ensures that all the other characters are maintained the way they were before deletion. Ie if they were capital, they remain so and vice versa

autos_df.name.replace(regex=r'(?i)'+ autos_df.brand,value="")
Out[1726]: 
0                            _807_160_NAVTECH_ON_BOARD
1              _740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik
2                                     _Golf_1.6_United
3            __fortwo_coupe_softouch/F1/Klima/Panorama
4    _Grand_Voyager_2.8_CRD_Aut.Limited_Stow´n_Go_S...
Name: name, dtype: object

1 Comment

This is the best solution ;) + 1

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.