2

A dataframe has two columns. One has a single integer per row. The other has a string of multiple integers, separated by ',', per row:

import pandas as pd
duck_ids = ["1, 4, 5, 7", "3, 11, 14, 27"]
ducks_of_interest = [4,15]
duck_df = pd.DataFrame(
    {
        "DucksOfInterests": ducks_of_interest,
        "DuckIDs": duck_ids
    }
)
print(f"The starting dataframe:\n{duck_df}")


   DucksOfInterests        DuckIDs
0                 4     1, 4, 5, 7
1                15  3, 11, 14, 27

A new column is required that returns a True if the Duck of Interest is within the set of Duck IDs. This is attempted using a simple lambda function with the .apply method:

duck_df['DoIinDIDs'] = duck_df.apply(lambda x: str(x['DuckIDs']) in [x['DucksOfInterests']], axis=1)

This was expected to return a True for the first row, as 4 is a number in "1, 4, 5, 7", and False for the second row. However, the result is False for both rows:

print(f"The dataframe with the additional column:\n{duck_df}")

   DucksOfInterests        DuckIDs  DoIinDIDs
0                 4     1, 4, 5, 7      False
1                15  3, 11, 14, 27      False

What is the error in the code or the approach?

2 Answers 2

3

You were almost there but unnecessarily used a list and swapped the names:

duck_df['DoIinDIDs'] = duck_df.apply(lambda x: str(x['DucksOfInterests'])
                                     in x['DuckIDs'], axis=1)

Output:

   DucksOfInterests        DuckIDs  DoIinDIDs
0                 4     1, 4, 5, 7       True
1                15  3, 11, 14, 27      False

Note, however, that this approach might fail as you rely on the whole string and 4 would be found in 1, 14, 20.

You can instead split the string:

duck_df['DoIinDIDs'] = duck_df.apply(lambda x: str(x['DucksOfInterests'])
                                     in x['DuckIDs'].split(', '), axis=1)

Finally, as apply on axis=1 is slow, you can replace the whole thing by a list comprehension:

duck_df['DoIinDIDs'] = [str(a) in b.split(', ')
                        for a, b in zip(duck_df['DucksOfInterests'],
                                        duck_df['DuckIDs'])]
Sign up to request clarification or add additional context in comments.

2 Comments

I hadn't realised the error you noted, thank you for pointing this out and explaining the use of split to avoid this issue.
You're welcome. Also, if you don't have a reliable separator you could use a regex instead (str(a) in re.findall('\d+', b)). Or even bool(re.search(fr'\b{a}\b', b)) in place of str(a) in b.split(', ')
1

You have two issues, you need to replace the order of DucksOfInterests and DuckIDs and you need to convert the string to list of ints rather than the int to string, "4" in "3, 11, 14, 27" will return True

duck_df['DoIinDIDs'] = duck_df.apply(lambda x: x['DucksOfInterests'] in map(int, x['DuckIDs'].split(',')), axis=1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.