3

How can I replace the values of an existing dataframe column with the values from the re.search loop?

This is my re.search loop.

for i in dataset['col1']:
    clean = re.search('(nan|[0-9]{1,4})([,.][0-9]{1,4})?', i)
    print(clean.group())    

This is the sample data set (dataset)

    year    col1
1    2001    10.563\D
2    2002    9.540\A
3    2003    4.674\G
4    2004    3.2754\u
5    2005    nan\x
3
  • What is your expected output? Commented Apr 1, 2020 at 6:57
  • year col1 1 2001 10.563 2 2002 9.540 3 2003 4.674 4 2004 3.2754 5 2005 nan Commented Apr 1, 2020 at 6:58
  • basically remove the \ and the letters :) Commented Apr 1, 2020 at 6:59

4 Answers 4

3

You can use Series.apply to apply the custom function to the dataset["col1"]. Or, better you can use Series.str.replace to replace the pattern with the replacement string.

Try this:

def func(i):
    clean = re.search('(nan|[0-9]{1,4})([,.][0-9]{1,4})?', i)
    return clean.group()

dataset["col1"] = dataset["col1"].apply(func)

OR Better,

df["col1"] = df["col1"].str.replace(r'(.*?)(\\.*?$)', r"\1")

Output:

>>> print(dataset)

   year    col1
0  2001  10.563
1  2002   9.540
2  2003   4.674
3  2004  3.2754
4  2005     nan
Sign up to request clarification or add additional context in comments.

2 Comments

thanks a lot! I want to understand the second code you shared. why did you replace it with \1? or did I understand the code correct?
@Lara \1 is just a way to reference the first capturing group in the the pattern which is (.*?) which represents the data you want to keep.
2

Using your method:

dataset["col1"] = dataset["col1"].apply(lambda x: re.search('(nan|[0-9]{1,4})([,.][0-9]{1,4})?', x).group())

though personally, I would do this instead:

dataset["col1"] = dataset["col1"].str[:-2]

Comments

2

You can use pandas str extract, with a look ahead assertion - it will keep only items before the '\'

  df['cleaned'] = df["col1"].str.extract(r'(.*(?=\\))')

     year   col1        cleaned
1   2001    10.563\D    10.563
2   2002    9.540\A     9.540
3   2003    4.674\G     4.674
4   2004    3.2754\u    3.2754
5   2005    nan\x       nan

Comments

2

I would use split function rather than longer regular expressions patterns in this case:

dataset['col1'] = dataset['col1'].str.split('\\').str[0]

or, to split into float data type:

dataset['col1'] = dataset['col1'].str.split('\\').str[0].astype(float)

This would transform these values in place, and is not error prone. It simply always takes the first element from the resulting array in case a backslash exists.

Result:

   year    col1
0  2001  10.563
1  2002   9.540
2  2003   4.674
3  2004  3.2754
4  2005     nan

3 Comments

thanks for this! I can also use this but how do I make the str into float after splitting?
@lara, if this worked for you, please let me know. Also you must not forget to appreaciate the effort from others. Either by upvoting informative and helpfull answers and/or accepting the answer that solved your issue.
yes just tried this now and it worked fine. thanks for the help! :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.