0

I have a dataframe df such that:

df['user_location'].value_counts()
India                           3741
United States                   2455
New Delhi, India                1721
Mumbai, India                   1401
Washington, DC                  1354
                                ... 
SpaceCoast,Florida                 1
stuck in a book.                   1
Beirut , Lebanon                   1
Royston Vasey - Tralfamadore       1
Langham, Colchester                1
Name: user_location, Length: 26920, dtype: int64

I want to know the frequency of specific countries like USA, India from the user_location column. Then I want to plot the frequencies as USA, India, and Others. So, I want to apply some operation on that column such that the value_counts() will give the output as:

India     (sum of all frequencies of all the locations in India including cities, states, etc.)
USA       (sum of all frequencies of all the locations in the USA including cities, states, etc.)
Others    (sum of all frequencies of the other locations)                    

Seems I should merge the frequencies of rows containing the same country names and merge the rest of them together! But it appears complex while handling the names of the cities, states, etc. What is the most efficient way to do it?

5
  • 1
    df['user_location'].value_counts()[['United States', 'India']] & df['user_location'].value_counts()[['United States', 'India']].plot.bar(). Commented Aug 30, 2020 at 4:08
  • If you see properly, the data frame contains many other rows contains the name India, USA and also in a different way, some has USA, some as the United States! Commented Aug 30, 2020 at 4:14
  • 1
    You may want to map alternate names to a single name (e.g. df['user_location'] = df['user_location'].map({'USA': 'United States'})). pandas.Series.map Commented Aug 30, 2020 at 4:17
  • Yes, not only the alternate names but also wanna combine its states such as I wanna show the frequencies of India, New Delhi, India, Mumbai, India, ... in a single name. Basically I want to show the frequencies in country-wise not state-wise. Commented Aug 30, 2020 at 4:25
  • Any criticisms and suggestions to improve the efficiency and readability of my solution to this issue would be greatly appreciated: codereview.stackexchange.com/q/248918/230104 Commented Sep 6, 2020 at 4:11

2 Answers 2

2

Adding to @Trenton_McKinney 's answer in the comments, if you need to map different country's states/provinces to the country name, you will have to do a little work to make those associations. For example, for India and USA, you can grab a list of their states from wikipedia and map them to your own data to relabel them to their respective country names as follows:

# Get states of India and USA
in_url = 'https://en.wikipedia.org/wiki/States_and_union_territories_of_India#States_and_Union_territories'
in_states = pd.read_html(in_url)[3].iloc[:, 0].tolist()
us_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
us_states = pd.read_html(us_url)[0].iloc[:, 0].tolist()
states = in_states + us_states

# Make a sample dataframe
df = pd.DataFrame({'Country': states})

    Country
0   Andhra Pradesh
1   Arunachal Pradesh
2   Assam
3   Bihar
4   Chhattisgarh
... ...
73  Virginia[E]
74  Washington
75  West Virginia
76  Wisconsin
77  Wyoming

Map state names to country names:

# Map state names to country name
states_dict = {state: 'India' for state in in_states}
states_dict.update({state: 'USA' for state in us_states})
df['Country'] = df['Country'].map(states_dict)

    Country
0   India
1   India
2   India
3   India
4   India
... ...
73  USA
74  USA
75  USA
76  USA
77  USA

But from your data sample it looks like you will have a lot of edge cases to deal with as well.

Sign up to request clarification or add additional context in comments.

1 Comment

Any criticisms and suggestions to improve the efficiency and readability of my solution would be greatly appreciated. codereview.stackexchange.com/q/248918/230104
0

Using the concept of the previous answer, firstly, I have tried to get all the locations including cities, unions, states, districts, territories. Then I have made a function checkl() such that it can check if the location is India or USA and then convert it into its country name. Finally the function has been applied on the dataframe column df['user_location'] :

# Trying to get all the locations of USA and India

import pandas as pd

us_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
us_states = pd.read_html(us_url)[0].iloc[:, 0].tolist()
us_cities = pd.read_html(us_url)[0].iloc[:, 1].tolist() + pd.read_html(us_url)[0].iloc[:, 2].tolist() + pd.read_html(us_url)[0].iloc[:, 3].tolist()
us_Federal_district = pd.read_html(us_url)[1].iloc[:, 0].tolist()
us_Inhabited_territories = pd.read_html(us_url)[2].iloc[:, 0].tolist()
us_Uninhabited_territories = pd.read_html(us_url)[3].iloc[:, 0].tolist()
us_Disputed_territories = pd.read_html(us_url)[4].iloc[:, 0].tolist()

us = us_states + us_cities + us_Federal_district + us_Inhabited_territories + us_Uninhabited_territories + us_Disputed_territories

in_url = 'https://en.wikipedia.org/wiki/States_and_union_territories_of_India#States_and_Union_territories'
in_states = pd.read_html(in_url)[3].iloc[:, 0].tolist() + pd.read_html(in_url)[3].iloc[:, 4].tolist() + pd.read_html(in_url)[3].iloc[:, 5].tolist()
in_unions = pd.read_html(in_url)[4].iloc[:, 0].tolist()
ind = in_states + in_unions

usToStr = ' '.join([str(elem) for elem in us])
indToStr = ' '.join([str(elem) for elem in ind]) 


# Country name checker function

def checkl(T): 
    TSplit_space = [x.lower().strip() for x in T.split()]
    TSplit_comma = [x.lower().strip() for x in T.split(',')]
    TSplit = list(set().union(TSplit_space, TSplit_comma))
    res_ind = [ele for ele in ind if(ele in T)]
    res_us = [ele for ele in us if(ele in T)]
  
    if 'india' in TSplit or 'hindustan' in TSplit or 'bharat' in TSplit or T.lower() in indToStr.lower() or bool(res_ind) == True :
        T = 'India'
    elif 'US' in T or 'USA' in T or 'United States' in T or 'usa' in TSplit or 'united state' in TSplit or T.lower() in usToStr.lower() or bool(res_us) == True:
        T = 'USA'
    elif len(T.split(','))>1 :
        if T.split(',')[0] in indToStr or  T.split(',')[1] in indToStr :
             T = 'India'
        elif T.split(',')[0] in usToStr or  T.split(',')[1] in usToStr :
             T = 'USA'
        else:
             T = "Others"
    else:
        T = "Others"
    return T

# Appling the function on the dataframe column

print(df['user_location'].dropna().apply(checkl).value_counts())
Others    74206
USA       47840
India     20291
Name: user_location, dtype: int64

I am quite new in python coding. I think this code can be written in a better and more compact form. And as it is mentioned in the previous answer, there are still a lot of edge cases to deal with. So, I have added it on Code Review Stack Exchange too. Any criticisms and suggestions to improve the efficiency and readability of my code would be greatly appreciated.

1 Comment

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.