Replacing dataframe column values from a re.search loop

Question

How can I replace the values of an existing dataframe column with the values from the re.search loop?

This is my re.search loop.

for i in dataset['col1']:
    clean = re.search('(nan|[0-9]{1,4})([,.][0-9]{1,4})?', i)
    print(clean.group())

This is the sample data set (dataset)

    year    col1
1    2001    10.563\D
2    2002    9.540\A
3    2003    4.674\G
4    2004    3.2754\u
5    2005    nan\x

year col1 1 2001 10.563 2 2002 9.540 3 2003 4.674 4 2004 3.2754 5 2005 nan — Yel
– Yel, Commented Apr 1, 2020 at 6:58

Shubham Sharma · Accepted Answer · 2020-04-01 07:08:54Z

3

You can use Series.apply to apply the custom function to the dataset["col1"]. Or, better you can use Series.str.replace to replace the pattern with the replacement string.

Try this:

def func(i):
    clean = re.search('(nan|[0-9]{1,4})([,.][0-9]{1,4})?', i)
    return clean.group()

dataset["col1"] = dataset["col1"].apply(func)

OR Better,

df["col1"] = df["col1"].str.replace(r'(.*?)(\\.*?$)', r"\1")

Output:

>>> print(dataset)

   year    col1
0  2001  10.563
1  2002   9.540
2  2003   4.674
3  2004  3.2754
4  2005     nan

edited Apr 1, 2020 at 7:08

answered Apr 1, 2020 at 7:00

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Yel Over a year ago

thanks a lot! I want to understand the second code you shared. why did you replace it with \1? or did I understand the code correct?

Shubham Sharma Over a year ago

@Lara \1 is just a way to reference the first capturing group in the the pattern which is (.*?) which represents the data you want to keep.

user13177201 · Accepted Answer · 2020-04-01 07:02:13Z

2

Using your method:

dataset["col1"] = dataset["col1"].apply(lambda x: re.search('(nan|[0-9]{1,4})([,.][0-9]{1,4})?', x).group())

though personally, I would do this instead:

dataset["col1"] = dataset["col1"].str[:-2]

answered Apr 1, 2020 at 7:02

user13177201

Comments

sammywemmy · Accepted Answer · 2020-04-01 07:29:32Z

2

You can use pandas str extract, with a look ahead assertion - it will keep only items before the '\'

  df['cleaned'] = df["col1"].str.extract(r'(.*(?=\\))')

     year   col1        cleaned
1   2001    10.563\D    10.563
2   2002    9.540\A     9.540
3   2003    4.674\G     4.674
4   2004    3.2754\u    3.2754
5   2005    nan\x       nan

answered Apr 1, 2020 at 7:29

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

Comments

JvdV · Accepted Answer · 2020-04-01 09:21:13Z

2

I would use split function rather than longer regular expressions patterns in this case:

dataset['col1'] = dataset['col1'].str.split('\\').str[0]

or, to split into float data type:

dataset['col1'] = dataset['col1'].str.split('\\').str[0].astype(float)

This would transform these values in place, and is not error prone. It simply always takes the first element from the resulting array in case a backslash exists.

Result:

   year    col1
0  2001  10.563
1  2002   9.540
2  2003   4.674
3  2004  3.2754
4  2005     nan

edited Apr 1, 2020 at 9:21

answered Apr 1, 2020 at 8:23

JvdV

76.8k8 gold badges48 silver badges90 bronze badges

3 Comments

Yel Over a year ago

thanks for this! I can also use this but how do I make the str into float after splitting?

JvdV Over a year ago

@lara, if this worked for you, please let me know. Also you must not forget to appreaciate the effort from others. Either by upvoting informative and helpfull answers and/or accepting the answer that solved your issue.

Yel Over a year ago

yes just tried this now and it worked fine. thanks for the help! :)

Collectives™ on Stack Overflow

Replacing dataframe column values from a re.search loop

4 Answers 4

2 Comments

Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related