0

I am working on a project of analyzing weather data. Below is a abbreviated version of my csv file (only focus on the last column "Conditions"):

Year,Month,Day,Hour,DOW,Maximum Temperature,Minimum Temperature,Temperature,Precipitation,Snow,SnowDepth,Wind Speed,Visibility,Cloud Cover,Relative Humidity,Conditions
2020,3,5,8,3,48.0,48.0,48.0,0.0,0.0,0.0,10.3,9.9,0.0,81.44,Clear
2020,3,5,10,3,56.9,56.9,56.9,0.0,0.0,0.0,6.3,9.9,25.1,55.29,Partially cloudy
2020,3,9,8,0,60.7,60.7,60.7,0.0,0.0,0.0,14.5,8.1,79.6,91.95,Overcast
2020,3,9,10,0,62.5,62.5,62.5,0.01,0.0,0.0,16.0,7.0,94.7,89.95,"Rain, Overcast"
2020,3,17,20,1,66.4,66.4,66.4,0.02,0.0,0.0,8.7,4.3,68.6,88.78,"Rain, Partially cloudy"

and I want to transfer it to something like this:

Clear,Partially cloudy,Rain,Overcast
1,0,0,0
0,1,0,0
0,0,0,1
0,0,1,1
0,1,1,0

I saw that I could use the code below but I don't know how to deal with the condition when I have 2 categories in one data.

dataset['Conditions'] = dataset['Conditions'].map({1: 'Clear', 2: 'Partially cloudy', 3: 'Rain', 4: 'Snow'})
dataset = pd.get_dummies(dataset, columns=['Conditions'], prefix='', prefix_sep='')

Thank you in advance : )

3 Answers 3

3

Try str.split + explode then sum level 0:

dummies = pd.get_dummies(
    dataset['Conditions'].str.split(', ').explode()
).sum(level=0)

print(dummies)

dummies:

   Clear  Overcast  Partially cloudy  Rain
0      1         0                 0     0
1      0         0                 1     0
2      0         1                 0     0
3      0         1                 0     1
4      0         0                 1     1

To join back to the original DataFrame:

dummies = pd.get_dummies(
    dataset['Conditions'].str.split(', ').explode()
).sum(level=0)
# Join Back to dataset
dataset = dataset.drop(columns='Conditions').join(dummies)
print(dataset.to_string())
   Year  Month  Day  Hour  ...  Clear  Overcast  Partially cloudy  Rain
0  2020      3    5     8  ...      1         0                 0     0
1  2020      3    5    10  ...      0         0                 1     0
2  2020      3    9     8  ...      0         1                 0     0
3  2020      3    9    10  ...      0         1                 0     1
4  2020      3   17    20  ...      0         0                 1     1
Sign up to request clarification or add additional context in comments.

4 Comments

sorry if this sounds kinda dumb cuz i'm new to all this. How can I merge "out" into the original csv file?
Instead of or in addition to the conditions column?
replace the "Conditions" column as the 4 columns
It's join but I added the complete code to my answer.
3

you can use pd.get_dummies:

result = (
    pd.get_dummies(
        df.Conditions.str.split(', ', expand=True)
        .stack())
    .sum(level=0)
)

OUTPUT:

   Clear  Overcast  Partially cloudy  Rain
0      1         0                 0     0
1      0         0                 1     0
2      0         1                 0     0
3      0         1                 0     1
4      0         0                 1     1

Comments

0
import pandas as pd 
xx = pd.DataFrame([[1,2,"ss"],[2,3,"cc"],[4,2,"d"]],columns=["v1","v2","s"])
pd.Series(xx["s"]).str.get_dummies()

1 Comment

How does this handle the case when there are multiple categorical data points in the same column "Rain, Partially cloudy" for example?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.