Pandas columns variable length string slicing based on conditions

Question

I have a data frame df like this,

    A           length
0   648702831   9
1    26533315   8
2         366   3
3   354701058   9
4    25708239   8
5       70554   5
6     1574512   7
7        3975   4

Now, I want to create a column based on some conditions like this,

if ['length] == 9 or ['length] == 5:
   then ['new_col'] = First 5 Characters of ['A']

else if ['length] == 8 or ['length] == 4:
   then ['new_col'] = "0" & First 4 Characters of ['A']

else if ['length] == 7 or ['length] == 3:
   then ['new_col'] = "00" & First 3 Characters of ['A']

else 
   ['new_col'] = ['A']

For above conditions, I created the following logic to check, (For a file with 10,000 rows, it takes a lot of time)

for i in df['length']:

    if i == 9 or i == 5:
        df['new_col'] = df['A'].astype(str).str[:5]
    elif i == 8 or i == 4:
        df['new_col'] = "0" + df['A'].astype(str).str[:4]

    elif i == 7 or i == 3:
        df['new_col'] = "00" + df['A'].astype(str).str[:3]

    else:
        df['new_col'] = df['A']

I get the following output,

    A          length   new_col
0   648702831   9      06487
1    26533315   8      02653
2         366   3      0366
3   354701058   9      03547
4     5708239   8      05708
5       70554   5      07055
6      1574512  7      01574
7         3975  4      03975

This is not I want and it seems to be working only for second condition that adds "0" in front when the length is 8 or 4.

I need my output like this,

   A           length   new_col
0   648702831   9       64870
1    26533315   8       02653
2         366   3       00366
3   354701058   9       35470
4     5708239   8       05708
5       70554   5       70554
6      1574512  7       00157
7         3975  4       03975

How can I achieve this and also if there is a pandas way that can take less time, that would be great. Any suggestion would be appreciated.

cs95 · Accepted Answer · 2018-12-18 17:15:57Z

3

Use string slicing with zfill. For speed, use a list comprehension.

m = {1: 5, 0: 4, 3: 3}
df['new_col'] = [
    x[:m.get(y % 4, 4)].zfill(5) for x, y in zip(df['A'].astype(str), df['length'])]

df
           A  length new_col
0  648702831       9   64870
1   26533315       8   02653
2        366       3   00366
3  354701058       9   35470
4   25708239       8   02570
5      70554       5   70554
6    1574512       7   00157
7       3975       4   03975

To handle the default case, we can implement a little extra checking when calling zfill:

df = df.append({'A' : 50, 'length': 2}, ignore_index=True)

m = {1: 5, 0: 4, 3: 3}

df['new_col'] = [
    x[:m.get(y % 4, 4)].zfill(5 if y % 4 in m else 0) 
    for x, y in zip(df['A'].astype(str), df['length'])
]

df
           A  length new_col
0  648702831       9   64870
1   26533315       8   02653
2        366       3   00366
3  354701058       9   35470
4   25708239       8   02570
5      70554       5   70554
6    1574512       7   00157
7       3975       4   03975
8         50       2      50   # Default case.

edited Dec 18, 2018 at 17:15

answered Dec 18, 2018 at 16:30

cs95

406k106 gold badges745 silver badges798 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user9431057 Over a year ago

I am getting a AttributeError: 'DataFrame' object has no attribute 'length' that for my orginal dataframe. For the my test case with smaller data frame this is working. I googled it, and found I can use df[column].length. But no luck yet! Any idea why?

cs95 Over a year ago

@user9431057 Can you tell me the output of df.columns?

user9431057 Over a year ago

this is what I get: Index(['A', 'length', 'new_col'], dtype='object')

cs95 Over a year ago

@user9431057 I've edited. What happens if you try to access using df['length']?

jpp · Accepted Answer · 2018-12-18 16:34:30Z

3

You can use a list comprehension with a dictionary. This is perfectly acceptable considering Pandas str methods are not vectorised.

d = {5: 5, 9: 5, 8: 4, 4: 4, 3: 3, 7: 3}

zipper = zip(df['A'].astype(str), df['length'])

df['new_col'] = [A[:d[L]].zfill(5) if L in d else A for A, L in zipper]

print(df)

           A  length new_col
0  648702831       9   64870
1   26533315       8   02653
2        366       3   00366
3  354701058       9   35470
4   25708239       8   02570
5      70554       5   70554
6    1574512       7   00157
7       3975       4   03975
8         12       2      12

answered Dec 18, 2018 at 16:34

jpp

166k37 gold badges301 silver badges363 bronze badges

Comments

BENY · Accepted Answer · 2018-12-18 16:36:50Z

3

Fix your code

df['new_col']=''
for i,j in zip(df['length'],df.index):

    df.A = df.A.astype(str)
    if i == 9 or i == 5:
        df.loc[j,'new_col'] =  df.loc[j,'A'][:5]
    elif i == 8 or i == 4:
        df.loc[j, 'new_col'] = "0" + df.loc[j,'A'][:4]

    elif i == 7 or i == 3:
        df.loc[j, 'new_col'] = "00" + df.loc[j,'A'][:3]

    else:
        df.loc[j, 'new_col']= df.loc[j,'A']


df
Out[52]: 
           A  length new_col
0  648702831       9   64870
1   26533315       8   02653
2        366       3   00366
3  354701058       9   35470
4   25708239       8   02570
5      70554       5   70554
6    1574512       7   00157
7       3975       4   03975

answered Dec 18, 2018 at 16:36

BENY

324k22 gold badges176 silver badges250 bronze badges

2 Comments

user9431057 Over a year ago

thanks for the post. Why did we zip here? Is to make it faster (the way I have?)?

BENY Over a year ago

@user9431057 the way you have change the whole value for each time , zip here is using the index adding each value within one loop , rather than change the whole columns, why you get your "wrong" output --- you overwrite whole column each time , so the final output of new-column equal to "0" + df['A'].astype(str).str[:4]

Matt W. · Accepted Answer · 2018-12-18 16:32:38Z

0

You can do it using a lambda function:

df = pd.DataFrame({'A':[298347,9287384, 983, 9283, 894, 1]})
df['new_col'] = df['A'].apply(lambda x: '{0:0>8}'.format(x))

         A      Col1
0   298347  00298347
1  9287384  09287384
2      983  00000983
3     9283  00009283
4      894  00000894
5        1  00000001

answered Dec 18, 2018 at 16:32

Matt W.

3,7327 gold badges28 silver badges48 bronze badges

1 Comment

user9431057 Over a year ago

thanks for the post, but I need something like I posted above. A need to be able to add one zero or two "00"'s based on the condition.

Collectives™ on Stack Overflow

Pandas columns variable length string slicing based on conditions

4 Answers 4

4 Comments

Comments

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related