Python regular expression in a column of a dataframe

Question

I have written a python script for Excel automation. I am stuck in between a point during this automation. I want to apply regular Expression in a column of a dataframe. Tried many ways but not able to produce exactly desired result as I wants. I have dataframe like following(short sample example) -

This is sample dataframe and this has large number of columns. I want to apply regular expression in column C named as ID column. I want to split data in this dataframe based upon $, & separator but also wants to ignore(delete) all the values between * and & or * and $. Row where we find empty cell in column C(ID) can be deleted or ignore. Following is example of output dataframe that I want-

I am have tried following-

import pandas as pd
import re
df = pd.read_excel("Deal Id Part Comparison Master File.xlsx", "Data Dump", header=1)
splits= []
for i in df['ID']:
    s = str(i)
    splits.append(re.split('\$|\&',s))

print(f' final list {splits}')

Above code is able to split data based upon $ and & and storing them in list. But data between * and $ or * and & is not ignored. Also I want to explode the data.

I am sure that there can be one liner to achieve this task but not able to generate final output

@Ch3steR but I want to ignore data between * and & or * or $. This will not give exact solution. Hence this will not give exact result — Vishav Gupta
– Vishav Gupta, Commented Jan 4, 2021 at 14:40

Wiktor Stribiżew · Accepted Answer · 2021-01-04 15:13:59Z

2

You can use

import pandas as pd
df = pd.DataFrame({'Order': ['10-112','10-115'], 'Owner':['shubhman', 'rishab'], 'ID':['89ab$cd&78','']})

df['ID'] = df['ID'].str.replace(r'\*[^&$]*[&$]', '').str.split(r'[$&]') # Remove substrings between * and $ or &
df = df.explode('ID') # Split the rows with multiple IDs into multiple rows
df = df[df['ID'].astype(bool)] # Discard the rows with an empty ID
>>> df
    Order     Owner    ID
0  10-112  shubhman  89ab
0  10-112  shubhman    cd
0  10-112  shubhman    78

The regexps here match:

.replace(r'\*[^&$]*[&$]', '') - replaces all substrings between * (matched with \*) and the closest, leftmost & or $ (what comes first), see the regex demo
.str.split(r'[$&]') - splits with either $ or & char (note you do not need to escape either inside a character class).

answered Jan 4, 2021 at 15:13

Wiktor Stribiżew

631k41 gold badges502 silver badges633 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Vishav Gupta Over a year ago

@Wiktor Stribiżew thank you for your answer. this works good for me...

Vishav Gupta Over a year ago

@Ch3steR thanks for your help also. As you have also tried your best to provide solution

Anthony · Accepted Answer · 2021-01-04 14:50:53Z

I would need to know more about the data to be safe but you should think in terms of vectors first and loops second when dealing with DataFrames.

Look into string accessor methods.

pandas.Series.str.replace to delete values between * and $, &

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html

pandas.Series.str.split

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html?highlight=str%20split#pandas.Series.str.split

pandas.wide_to_long

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.wide_to_long.html#pandas.wide_to_long

pandas.melt

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html#pandas.melt

Collectives™ on Stack Overflow

Python regular expression in a column of a dataframe

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related