Using the concept of the previous answer, firstly, I have tried to get all the locations including cities, unions, states, districts, territories. Then I have made a function checkl() such that it can check if the location is India or USA and then convert it into its country name. Finally the function has been applied on the dataframe column df['user_location'] :
# Trying to get all the locations of USA and India
import pandas as pd
us_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
us_states = pd.read_html(us_url)[0].iloc[:, 0].tolist()
us_cities = pd.read_html(us_url)[0].iloc[:, 1].tolist() + pd.read_html(us_url)[0].iloc[:, 2].tolist() + pd.read_html(us_url)[0].iloc[:, 3].tolist()
us_Federal_district = pd.read_html(us_url)[1].iloc[:, 0].tolist()
us_Inhabited_territories = pd.read_html(us_url)[2].iloc[:, 0].tolist()
us_Uninhabited_territories = pd.read_html(us_url)[3].iloc[:, 0].tolist()
us_Disputed_territories = pd.read_html(us_url)[4].iloc[:, 0].tolist()
us = us_states + us_cities + us_Federal_district + us_Inhabited_territories + us_Uninhabited_territories + us_Disputed_territories
in_url = 'https://en.wikipedia.org/wiki/States_and_union_territories_of_India#States_and_Union_territories'
in_states = pd.read_html(in_url)[3].iloc[:, 0].tolist() + pd.read_html(in_url)[3].iloc[:, 4].tolist() + pd.read_html(in_url)[3].iloc[:, 5].tolist()
in_unions = pd.read_html(in_url)[4].iloc[:, 0].tolist()
ind = in_states + in_unions
usToStr = ' '.join([str(elem) for elem in us])
indToStr = ' '.join([str(elem) for elem in ind])
# Country name checker function
def checkl(T):
TSplit_space = [x.lower().strip() for x in T.split()]
TSplit_comma = [x.lower().strip() for x in T.split(',')]
TSplit = list(set().union(TSplit_space, TSplit_comma))
res_ind = [ele for ele in ind if(ele in T)]
res_us = [ele for ele in us if(ele in T)]
if 'india' in TSplit or 'hindustan' in TSplit or 'bharat' in TSplit or T.lower() in indToStr.lower() or bool(res_ind) == True :
T = 'India'
elif 'US' in T or 'USA' in T or 'United States' in T or 'usa' in TSplit or 'united state' in TSplit or T.lower() in usToStr.lower() or bool(res_us) == True:
T = 'USA'
elif len(T.split(','))>1 :
if T.split(',')[0] in indToStr or T.split(',')[1] in indToStr :
T = 'India'
elif T.split(',')[0] in usToStr or T.split(',')[1] in usToStr :
T = 'USA'
else:
T = "Others"
else:
T = "Others"
return T
# Appling the function on the dataframe column
print(df['user_location'].dropna().apply(checkl).value_counts())
Others 74206
USA 47840
India 20291
Name: user_location, dtype: int64
I am quite new in python coding. I think this code can be written in a better and more compact form. And as it is mentioned in the previous answer, there are still a lot of edge cases to deal with. So, I have added it on
Code Review Stack Exchange too. Any criticisms and suggestions to improve the efficiency and readability of my code would be greatly appreciated.
df['user_location'].value_counts()[['United States', 'India']]&df['user_location'].value_counts()[['United States', 'India']].plot.bar().India,USAand also in a different way, some hasUSA, some as theUnited States!df['user_location'] = df['user_location'].map({'USA': 'United States'})).pandas.Series.mapIndia,New Delhi, India,Mumbai, India, ... in a single name. Basically I want to show the frequencies in country-wise not state-wise.