Create a new dataframe based off of strings lengths of values from existing dataframe

Question

Sorry if the title is unclear - I wasn't too sure how to word it. So I have a dataframe that has two columns for old IDs and new IDs.

df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})

I'm trying to figure out a way to check the string length of each column/row and return any id's that don't match the required string length of 4 into a new dataframe. This will eventually turn into a dictionary of incorrect IDs.

This is the approach I'm currently taking:

incorrect_id_df = df[df.applymap(lambda x: len(x) != 4)]

and the current output:

old_id new_id
 111    NaN
 NaN    NaN
 NaN    777
 NaN    NaN

I'm not sure where to go from here and I'm sure there's a much better approach but this is the output I'm looking for where it's a single column dataframe with just the IDs that don't match the required string length and also with the column name id:

 id
 111
 777

Can you just .stack() that dataframe and use it's .values attribute as an invalid list... but then you still have reference to what column it was found on ? — Jon Clements
– Jon Clements, Commented Jun 29, 2022 at 20:19

Rodalm · Accepted Answer · 2022-06-29 20:37:05Z

2

In general, DataFrame.applymap is pretty slow, so you should avoid it. I would stack both columns in a single one, and select the ids with length 4:

import pandas as pd

df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})

ids = df.stack()
bad_ids = ids[ids.str.len() != 4]

Output:

>>> bad_ids

0  old_id    111
2  new_id    777
dtype: object

The advantage of this approach is that now you have the location of the bad IDs which might be useful later. If you don't need it you can just use ids = df.stack().reset_index().

answered Jun 29, 2022 at 20:37

Rodalm

5,7589 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mitoRibo · Accepted Answer · 2022-06-29 20:24:51Z

0

here's part of an answer

df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})
all_ids = df.values.flatten()
bad_ids = [bad_id for bad_id in all_ids if len(bad_id) != 4]
bad_ids

answered Jun 29, 2022 at 20:24

mitoRibo

4,5981 gold badge16 silver badges24 bronze badges

Comments

Jakub · Accepted Answer · 2022-06-29 20:33:08Z

0

Or if you are not completely sure what are you doing, you can always use brutal force method :D

import pandas as pd

df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})

rows,colums= df.shape

#print (df)

for row in range(rows):
    k= (df.loc[row])
    for colum in range(colums):
        #print(k.iloc[colum])
        if len(k.iloc[colum])!=4:
            print("Bad size of ID on row:"+str(row)+" colum:"+str(colum))

answered Jun 29, 2022 at 20:33

Jakub

1949 bronze badges

Comments

fsimonjetz · Accepted Answer · 2022-06-29 20:34:32Z

0

As commented by Jon Clements, stack could be useful here – it basically stacks (duh) all columns on top of each other:

>>> df[df.applymap(len) != 4].stack().reset_index(drop=True)
0    111
1    777
dtype: object

To turn that into a single-column df named id, you can extend it with a .rename('id').to_frame().

edited Jun 29, 2022 at 20:34

answered Jun 29, 2022 at 20:29

fsimonjetz

5,7923 gold badges7 silver badges23 bronze badges

Collectives™ on Stack Overflow

Create a new dataframe based off of strings lengths of values from existing dataframe

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related