7

I have a dataframe df and it looks like this:

         id                        Type                        agent_id  created_at
0       44525   Stunning 6 bedroom villa in New Delhi               184  2018-03-09
1       44859   Villa for sale in Amritsar                          182  2017-02-19
2       45465   House in Faridabad                                  154  2017-04-17
3       50685   5 Hectre land near New Delhi                        113  2017-09-01
4      130728   Duplex in Mumbai                                    157  2017-02-07
5      130856   Large plot with fantastic views in Mumbai           137  2018-01-16
6      130857   Modern Design Penthouse in Bangalore                199  2017-03-24

I've this tabular data and I'm trying to clean this data by extracting keywords from the column and hence create a new dataframe with new columns.

Apartment  = ['apartment', 'penthouse', 'duplex']
House      = ['house', 'villa', 'country estate']
Plot       = ['plot', 'land']
Location   = ['New Delhi','Mumbai','Bangalore','Amritsar']

So the desired dataframe shoul look like this:

         id      Type        Location    agent_id  created_at
0       44525   House       New Delhi         184  2018-03-09
1       44859   House        Amritsar         182  2017-02-19
2       45465   House       Faridabad         154  2017-04-17
3       50685   Plot        New Delhi         113  2017-09-01
4      130728   Apartment      Mumbai         157  2017-02-07
5      130856   Plot           Mumbai         137  2018-01-16
6      130857   Apartment   Bangalore         199  2017-03-24

So till now i've tried this:

import pandas as pd
df = pd.read_csv('test_data.csv')

#i can extract these keywords one by one by using for loops but how
#can i do this work in pandas with minimum possible line of code.

for index, values in df.type.iteritems():
  for i in Apartment:
     if i in values:
         print(i)

df_new = pd. Dataframe(df['id'])

Can someone tell me how to solve this?

2 Answers 2

8

First create Location column by str.extract with | for regex OR:

pat = '|'.join(r"\b{}\b".format(x) for x in Location)
df['Location'] = df['Type'].str.extract('('+ pat + ')', expand=False)

Then create dictionary from another lists, swap keys with values and in loop set value by mask with str.contains and parameter case=False:

d = {'Apartment' : Apartment,
     'House' : House,
     'Plot' : Plot}

d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}

for k, v in d1.items():
    df.loc[df['Type'].str.contains(k, case=False), 'Type'] = v

print (df)
       id       Type  agent_id  created_at   Location
0   44525      House       184  2018-03-09  New Delhi
1   44859      House       182  2017-02-19   Amritsar
2   45465      House       154  2017-04-17        NaN
3   50685       Plot       113  2017-09-01  New Delhi
4  130728  Apartment       157  2017-02-07     Mumbai
5  130856       Plot       137  2018-01-16     Mumbai
6  130857  Apartment       199  2017-03-24  Bangalore
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the help. What if the keyword for 'Loaction' isn't there in the list what will happen then?? It'll put 'NAN' there?? @jezrael
@astroluv - yes, exactly, if value not exist then is created missing value. If necessary, last step should be df['Location'] = df['Location'].fillna('not exist location') for replace NaN to string.
-2

106 if isna(key).any(): --> 107 raise ValueError('cannot index with vector containing ' 108 'NA / NaN values') 109 return False

ValueError: cannot index with vector containing NA / NaN values

I got above error

1 Comment

Hi Awani! If you're having trouble with the accepted answer, you can ask for more info in the comment section of that answer, or you could even ask a question directly on Stack Overflow

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.