2

Scenario: I have a pandas dataframe. I am trying to use the values in a given column (year) to find the relevant header name and add it to a new column (year_name). For example, if the dataframe looks like this:

itemName 2020 2021 2022 2023 2024 year
item1 5 20 10 10 50 3
item2 10 10 50 20 40 2
item3 12 35 73 10 54 4

The result should be like this:

itemName 2020 2021 2022 2023 2024 year year_name
item1 5 20 10 10 50 3 2022
item2 10 10 50 20 40 2 2021
item3 12 35 73 10 54 4 2023

Obs. the itemName column is the index.

Issue: I am trying to use a lambda function to use the value of each row of "year" and use it to find the column name for that row and add it to the year_name column.

Function: I tried:

col_names = result_dict[col].columns.tolist()
result_df[[last_year_header']] = result_df[[_last_year']].apply(lambda x: col_names[x])

but this gave me the following error:

 TypeError: list indices must be integers or slices, not Series

I also tried:

col_names = result_dict[col].columns.tolist()
result_df[[last_year_header']] = result_df[[_last_year']].apply(lambda x: col_names[x.iloc[0].astype(int)])

But this gave me:

 IndexError: list index out of range

Question: I am clearly missing something with the implementation of the lambda function in this case. How can I fix this?

1 Answer 1

3

You don't need a lmabda, you should be able to directly index your columns index:

df['year_name'] = df.columns[df['year']-1] #.astype('Int64')

Output:

          2020  2021  2022  2023  2024  year year_name
itemName                                              
item1        5    20    10    10    50     3      2022
item2       10    10    50    20    40     2      2021
item3       12    35    73    10    54     4      2023

If there can be invalid values in df['year'], you could use a Series with reindex:

df['year_name'] = pd.Series(df.columns).reindex(df['year']-1).values

Example output:

          2020  2021  2022  2023  2024  year year_name
itemName                                              
item1        5    20    10    10    50   3.0      2022
item2       10    10    50    20    40  20.0       NaN
item3       12    35    73    10    54   NaN       NaN

Reproducible inputs:

# Example 1 (valid values)
df = pd.DataFrame.from_dict({
        'index': ['item1', 'item2', 'item3'],
        'columns': [2020, 2021, 2022, 2023, 2024, 'year'],
        'data': [[5, 20, 10, 10, 50, 3],
                 [10, 10, 50, 20, 40, 2],
                 [12, 35, 73, 10, 54, 4],],
        'index_names': ['itemName'],
        'column_names': [None],
    }, 'tight')

# Example 2 (invalid values)
df = pd.DataFrame.from_dict({
        'index': ['item1', 'item2', 'item3'],
        'columns': [2020, 2021, 2022, 2023, 2024, 'year'],
        'data': [[5, 20, 10, 10, 50, 3],
                 [10, 10, 50, 20, 40, 20],
                 [12, 35, 73, 10, 54, None],],
        'index_names': ['itemName'],
        'column_names': [None],
    }, 'tight')
Sign up to request clarification or add additional context in comments.

4 Comments

many thanks for the answer. For some reason, both give errors. First one gives: "IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices" and second one gives: "ValueError: Columns must be same length as key". Could this be related to the fact that some rows might have np.nan in the year value?
Can you provide a minimal reproducible example of your input that triggers the error? I'm surprised that the reindex approach fails.
Can you check the output of df['year'].unique()?
I found out the mistake. On the assignment, I had a double bracket set up. I had: df[['year_name']] instead of df['year_name']. Just fixed it now. many thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.