I am working with a dataset of used cars made available in a DataQuest guided project (https://www.dataquest.io/m/294/guided-project%3A-exploring-ebay-car-sales-data/). I have provided a sample of the data for this question.
What I am trying to do is remove redundant information from the car name such as brand name. Brand is already contained in another column in the data, and the exercise is working with pandas for data cleaning, so I want to see if there is a clean way to replace such substrings with the library's functionality. I have tried passing a pandas Series as the pat argument in Series.str.replace(), but obviously it won't work. What's a clean way to perform a vectorized replacement on a pandas Series based on another Series?
Ideally 'Peugeot_807_160_NAVTECH_ON_BOARD' would become '_807_160_NAVTECH_ON_BOARD' and so on.
import pandas as pd
autos_dict = {
'brand': ['peugeot', 'bmw', 'volkswagen', 'smart', 'chrysler'],
'name': [
'Peugeot_807_160_NAVTECH_ON_BOARD',
'BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik',
'Volkswagen_Golf_1.6_United',
'Smart_smart_fortwo_coupe_softouch/F1/Klima/Panorama',
'Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Stow´n_Go_Sitze_7Sitze'
]
}
autos_df = pd.DataFrame.from_dict(autos_dict)
autos_df['name'].str.replace(autos_df['brand'], '', case=False)
The following error message is returned:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/anaconda3/lib/python3.6/site-packages/pandas/core/strings.py", line 2429, in replace
flags=flags, regex=regex)
File "/anaconda3/lib/python3.6/site-packages/pandas/core/strings.py", line 656, in str_replace
compiled = re.compile(pat, flags=flags)
File "/anaconda3/lib/python3.6/re.py", line 233, in compile
return _compile(pattern, flags)
File "/anaconda3/lib/python3.6/re.py", line 289, in _compile
p, loc = _cache[type(pattern), pattern, flags]
File "/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 1489, in __hash__
' hashed'.format(self.__class__.__name__))
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I would be fine doing this with raw Python, so please respond only if you have a pandas-based solution.