0

I am trying to add multiple columns to a dataframe with numpy.where() in an ETL logic.

This is my df:

enter image description here

I am trying to get my df as:

enter image description here

And the code is:

current_time = pd.Timestamp.utcnow().strftime('%Y-%m-%d %H:%M:%S')

df = pd.concat(
    [
        df,
        pd.DataFrame(
            [
                np.where(
                    # When old hash code is available and new hash code is not available. 0 -- N
                    (
                            df['new_hash'].isna()
                            &
                            ~df['old_hash'].isna()
                    ) |
                    # When hash codes are available and matched. 3.1 -- 'N'
                    (
                            ~df['new_hash'].isna()
                            &
                            ~df['old_hash'].isna()
                            &
                            ~(df['new_hash'].ne(df['old_hash']))
                    ),
                    ['N', df['cr_date'], df['up_date']],
                    np.where(
                        # When new hash code is available and old hash code is not available. 1 -- Y
                        (
                                ~df['new_hash'].isna()
                                &
                                df['old_hash'].isna()
                        ),
                        ['Y', current_time, current_time],
                        np.where(
                            # When hash codes are available and matched. 3.2 -- 'Y'
                            (
                                    ~df['new_hash'].isna()
                                    &
                                    ~df['old_hash'].isna()
                                    &
                                    df['new_hash'].ne(df['old_hash'])
                            ),
                            ['Y', df['cr_date'], current_time],
                            ['N', df['cr_date'], df['up_date']]
                        )
                    )
                )
            ],
            index=df.index,
            columns=['is_changed', 'cr_date_new', 'up_date_new']
        )
    ],
    axis=1
)

Tried above code with df.join() instead of pd.concat(). Still giving me below specified ValueError

I am able add one column at a time. and the example is:

df['is_changed'] = (
    np.where(
        # When old hash code is available and new hash code is not available. 0 -- N
        (
                df['new_hash'].isna()
                &
                ~df['old_hash'].isna()
        ) |
        # When hash codes are available and matched. 3.1 -- 'N'
        (
                ~df['new_hash'].isna()
                &
                ~df['old_hash'].isna()
                &
                ~(df['new_hash'].ne(df['old_hash']))
        ),
        'N',
        np.where(
            # When new hash code is available and old hash code is not available. 1 -- Y
            (
                    ~df['new_hash'].isna()
                    &
                    df['old_hash'].isna()
            ),
            'Y',
            np.where(
                # When hash codes are available and matched. 3.2 -- 'Y'
                (
                        ~df['new_hash'].isna()
                        &
                        ~df['old_hash'].isna()
                        &
                        df['new_hash'].ne(df['old_hash'])
                ),
                'Y',
                'N'
            )
        )
    )
)

But getting error (ValueError: operands could not be broadcast together with shapes (66,) (3,) (3,)) with multiple columns.

what is the wrong with adding multiple columns? Can someone help me in this?

3
  • 5
    That code is almost totally illegible to me. It's just so far removed from regular python formatting Commented Dec 28, 2019 at 0:37
  • 2
    I also don't follow how this is a join. You haven't (I don't think) shown df, so I don;t know what concat is working on. This is not a minimal reproducible example and I strongly suggest that you change the formatting style Commented Dec 28, 2019 at 0:47
  • @roganjosh hope my changes will be understandable now Commented Dec 28, 2019 at 1:41

1 Answer 1

1

In np.where(cond,A,B) Python evaluates each of cond, A and B, and then passes them to the where function. where then broadcasts the inputs against each other, and performs the element-wise selection. You appear to have 3 nested where. I'm guessing the error occurs in the inner most one, since it will be evaluated first (I wouldn't have to guess if you provided the error traceback.)

                    np.where(
                        # When hash codes are available and matched. 3.2 -- 'Y'
                        (
                                ~df['new_hash'].isna()
                                &
                                ~df['old_hash'].isna()
                                &
                                df['new_hash'].ne(df['old_hash'])
                        ),
                        ['Y', df['cr_date'], current_time],
                        ['N', df['cr_date'], df['up_date']]
                    )

The cond part the first () logical_and expression.

The A is the 3 element list, and B the next list.

Assuming there are 66 rows, the cond will have (66,) shape.

np.array(['Y', df['cr_date'], current_time]) is probably a (3,) shape object dtype array, since the inputs consist on a string, a Series, and a time object.

That accounts for the 3 shapes in the error message: shapes (66,) (3,) (3,))

If you try to set only one column at a time, the expression would be np.where(cond, 'Y', 'N'), or np.where(cond, Series1, Series2).

If you don't understand what I (or the error) mean by broadcasting, you may need to learn more about numpy (which underlies pandas).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.