2

I am trying to work out why the column names for pandas.concat() are in brackets.

There is a similar question here - but in my context I don't understand how this can be hapenning. It is like there is a double bracket in the assignment, but given the concatenated dataframe looks fine I cannot understand what is causing it.

The output is below the code.

import warnings
import random
import pandas as pd # dataframe manipulation
import numpy as np # linear algebra
from sklearn.preprocessing import OneHotEncoder
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

url = 'https://raw.githubusercontent.com/bryonbaker/datasets/main/SIT720/Ass4/forestfires.csv'
full_df = pd.read_csv(url)
print(f"{full_df.head()}\n")

ohe = OneHotEncoder(handle_unknown='ignore', drop=None, dtype='int')

transformed = ohe.fit_transform(full_df[['month']])
month_df = pd.DataFrame(transformed.toarray())
month_df.columns = ohe.categories_

print(month_df.head())

full_df = full_df.drop(['month'], axis=1)

result = pd.concat([full_df, month_df], axis=1)
result.head()

The full output is:

   X  Y month  day  FFMC   DMC     DC  ISI  temp  RH  wind  rain  area
0  7  5   mar  fri  86.2  26.2   94.3  5.1   8.2  51   6.7   0.0   0.0
1  7  4   oct  tue  90.6  35.4  669.1  6.7  18.0  33   0.9   0.0   0.0
2  7  4   oct  sat  90.6  43.7  686.9  6.7  14.6  33   1.3   0.0   0.0
3  8  6   mar  fri  91.7  33.3   77.5  9.0   8.3  97   4.0   0.2   0.0
4  8  6   mar  sun  89.3  51.3  102.2  9.6  11.4  99   1.8   0.0   0.0

  apr aug dec feb jan jul jun mar may nov oct sep
0   0   0   0   0   0   0   0   1   0   0   0   0
1   0   0   0   0   0   0   0   0   0   0   1   0
2   0   0   0   0   0   0   0   0   0   0   1   0
3   0   0   0   0   0   0   0   1   0   0   0   0
4   0   0   0   0   0   0   0   1   0   0   0   0
X   Y   day FFMC    DMC DC  ISI temp    RH  wind    ... (dec,)  (feb,)  (jan,)  (jul,)  (jun,)  (mar,)  (may,)  (nov,)  (oct,)  (sep,)
0   7   5   fri 86.2    26.2    94.3    5.1 8.2 51  6.7 ... 0   0   0   0   0   1   0   0   0   0
1   7   4   tue 90.6    35.4    669.1   6.7 18.0    33  0.9 ... 0   0   0   0   0   0   0   0   1   0
2   7   4   sat 90.6    43.7    686.9   6.7 14.6    33  1.3 ... 0   0   0   0   0   0   0   0   1   0
3   8   6   fri 91.7    33.3    77.5    9.0 8.3 97  4.0 ... 0   0   0   0   0   1   0   0   0   0
4   8   6   sun 89.3    51.3    102.2   9.6 11.4    99  1.8 ... 0   0   0   0   0   1   0   0   0   0
5 rows × 24 columns

1 Answer 1

2

The categories are stored in a list of arrays. When you make them column names, each name becomes a one-element tuple. Change this line:

month_df.columns = ohe.categories_

to:

month_df.columns = ohe.categories_[0]
Sign up to request clarification or add additional context in comments.

2 Comments

Perfect. Thank you. But why didn't it show up in the original month_df?
Apparently it was concat that manipulated the column names.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.