0

Sorry if the title is unclear - I wasn't too sure how to word it. So I have a dataframe that has two columns for old IDs and new IDs.

df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})

I'm trying to figure out a way to check the string length of each column/row and return any id's that don't match the required string length of 4 into a new dataframe. This will eventually turn into a dictionary of incorrect IDs.

This is the approach I'm currently taking:

incorrect_id_df = df[df.applymap(lambda x: len(x) != 4)]

and the current output:

old_id new_id
 111    NaN
 NaN    NaN
 NaN    777
 NaN    NaN

I'm not sure where to go from here and I'm sure there's a much better approach but this is the output I'm looking for where it's a single column dataframe with just the IDs that don't match the required string length and also with the column name id:

 id
 111
 777
1
  • 1
    Can you just .stack() that dataframe and use it's .values attribute as an invalid list... but then you still have reference to what column it was found on ? Commented Jun 29, 2022 at 20:19

4 Answers 4

2

In general, DataFrame.applymap is pretty slow, so you should avoid it. I would stack both columns in a single one, and select the ids with length 4:

import pandas as pd

df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})

ids = df.stack()
bad_ids = ids[ids.str.len() != 4]

Output:

>>> bad_ids

0  old_id    111
2  new_id    777
dtype: object

The advantage of this approach is that now you have the location of the bad IDs which might be useful later. If you don't need it you can just use ids = df.stack().reset_index().

Sign up to request clarification or add additional context in comments.

Comments

0

here's part of an answer

df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})
all_ids = df.values.flatten()
bad_ids = [bad_id for bad_id in all_ids if len(bad_id) != 4]
bad_ids

Comments

0

Or if you are not completely sure what are you doing, you can always use brutal force method :D

import pandas as pd

df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})

rows,colums= df.shape

#print (df)

for row in range(rows):
    k= (df.loc[row])
    for colum in range(colums):
        #print(k.iloc[colum])
        if len(k.iloc[colum])!=4:
            print("Bad size of ID on row:"+str(row)+" colum:"+str(colum))

Comments

0

As commented by Jon Clements, stack could be useful here – it basically stacks (duh) all columns on top of each other:

>>> df[df.applymap(len) != 4].stack().reset_index(drop=True)
0    111
1    777
dtype: object

To turn that into a single-column df named id, you can extend it with a .rename('id').to_frame().

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.