2

I have written a python script for Excel automation. I am stuck in between a point during this automation. I want to apply regular Expression in a column of a dataframe. Tried many ways but not able to produce exactly desired result as I wants. I have dataframe like following(short sample example) -

this is input

This is sample dataframe and this has large number of columns. I want to apply regular expression in column C named as ID column. I want to split data in this dataframe based upon $, & separator but also wants to ignore(delete) all the values between * and & or * and $. Row where we find empty cell in column C(ID) can be deleted or ignore. Following is example of output dataframe that I want-

this is output

I am have tried following-

import pandas as pd
import re
df = pd.read_excel("Deal Id Part Comparison Master File.xlsx", "Data Dump", header=1)
splits= []
for i in df['ID']:
    s = str(i)
    splits.append(re.split('\$|\&',s))

print(f' final list {splits}')

Above code is able to split data based upon $ and & and storing them in list. But data between * and $ or * and & is not ignored. Also I want to explode the data.

I am sure that there can be one liner to achieve this task but not able to generate final output

2
  • @Ch3steR but I want to ignore data between * and & or * or $. This will not give exact solution. Hence this will not give exact result Commented Jan 4, 2021 at 14:40
  • My bad missed that point. Added an answer Commented Jan 4, 2021 at 14:54

2 Answers 2

2

You can use

import pandas as pd
df = pd.DataFrame({'Order': ['10-112','10-115'], 'Owner':['shubhman', 'rishab'], 'ID':['89ab$cd&78','']})

df['ID'] = df['ID'].str.replace(r'\*[^&$]*[&$]', '').str.split(r'[$&]') # Remove substrings between * and $ or &
df = df.explode('ID') # Split the rows with multiple IDs into multiple rows
df = df[df['ID'].astype(bool)] # Discard the rows with an empty ID
>>> df
    Order     Owner    ID
0  10-112  shubhman  89ab
0  10-112  shubhman    cd
0  10-112  shubhman    78

The regexps here match:

  • .replace(r'\*[^&$]*[&$]', '') - replaces all substrings between * (matched with \*) and the closest, leftmost & or $ (what comes first), see the regex demo
  • .str.split(r'[$&]') - splits with either $ or & char (note you do not need to escape either inside a character class).
Sign up to request clarification or add additional context in comments.

2 Comments

@Wiktor Stribiżew thank you for your answer. this works good for me...
@Ch3steR thanks for your help also. As you have also tried your best to provide solution
0

I would need to know more about the data to be safe but you should think in terms of vectors first and loops second when dealing with DataFrames.

Look into string accessor methods.

pandas.Series.str.replace to delete values between * and $, &

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html

pandas.Series.str.split

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html?highlight=str%20split#pandas.Series.str.split

pandas.wide_to_long

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.wide_to_long.html#pandas.wide_to_long

pandas.melt

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html#pandas.melt

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.