1

I have this frame:

regions = pd.read_html('http://www.mapsofworld.com/usa/usa-maps/united-states-regional-maps.html')
messy_regions = regions[8]

Which yields something like this:

    |0  |      1
--- |---| ---
0|  Region 1 (The Northeast)|   nan
1|  Division 1 (New England)|   Division 2 (Middle Atlantic)
2|  Maine                      | New York
3|  New Hampshire              | Pennsylvania
4|  Vermont                    | New Jersey
5|  Massachusetts               |nan
6|  Rhode Island                |nan
7|  Connecticut                | nan
8|  Region 2 (The Midwest)    |  nan
9|  Division 3 (East North Central)|    Division 4 (West North Central)
10| Wisconsin     |             North Dakota
11| Michigan     |              South Dakota
12| Illinois    |               Nebraska

The goal is to make this a tidy dataframe and I think I need to pivot in order to get the regions and Divisions as columns with the states as values under the correct regions/divisions. Once it's in that shape then I can just melt to the desired shape. I can't figure out though how to extract what would be the column headers out of this though. Any help is appreciated and at the very least a good point in the right direction.

1 Answer 1

1

You can use:

url = 'http://www.mapsofworld.com/usa/usa-maps/united-states-regional-maps.html'
#input dataframe with columns a, b 
df = pd.read_html(url)[8]
df.columns = ['a','b']

#extract Region data to new column
df['Region'] = df['a'].where(df['a'].str.contains('Region', na=False)).ffill()
#reshaping, remove rows with NaNs, remove column variable
df = pd.melt(df, id_vars='Region', value_name='Names')
       .sort_values(['Region', 'variable'])
       .dropna()
       .drop('variable', axis=1)
#extract Division data to new column
df['Division'] = df['Names'].where(df['Names'].str.contains('Division', na=False)).ffill()
#remove duplicates from column Names, change order of columns
df = df[(df.Division != df.Names) & (df.Region != df.Names)]
      .reset_index(drop=False)
      .reindex_axis(['Region','Division','Names'], axis=1)
#temporaly display all columns
with pd.option_context('display.expand_frame_repr', False):
    print (df)

                      Region                         Division                 Names
0   Region 1 (The Northeast)         Division 1 (New England)                 Maine
1   Region 1 (The Northeast)         Division 1 (New England)         New Hampshire
2   Region 1 (The Northeast)         Division 1 (New England)               Vermont
3   Region 1 (The Northeast)         Division 1 (New England)         Massachusetts
4   Region 1 (The Northeast)         Division 1 (New England)          Rhode Island
5   Region 1 (The Northeast)         Division 1 (New England)           Connecticut
6   Region 1 (The Northeast)     Division 2 (Middle Atlantic)              New York
7   Region 1 (The Northeast)     Division 2 (Middle Atlantic)          Pennsylvania
8   Region 1 (The Northeast)     Division 2 (Middle Atlantic)            New Jersey
9     Region 2 (The Midwest)  Division 3 (East North Central)             Wisconsin
10    Region 2 (The Midwest)  Division 3 (East North Central)              Michigan
11    Region 2 (The Midwest)  Division 3 (East North Central)              Illinois
12    Region 2 (The Midwest)  Division 3 (East North Central)               Indiana
13    Region 2 (The Midwest)  Division 3 (East North Central)                  Ohio
...
...
Sign up to request clarification or add additional context in comments.

2 Comments

Awesome! You made this look so easy. I'll be stepping through this to better understand each step.
Glad can help you. Also if need values in parentheses maybe help this answers. Nice day.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.