Replace '-' by 'E-' in dataframe cell IF the '-' is in the middle of a string

Question

I have a huge dataframe composed of 7 columns. Extract:

45589   664865.0    100000.0    7.62275    -.494     1.60149      100010
...
57205   718888.0    100000.0    8.218463    -1.405-3     1.75137      100010
...
55143   711827.0    100000.0    8.156107    9.8336-3    1.758051      100010

As these values come from an input file, there are currently all of string type and I would like to change all the dataframe to float through :

df= df.astype('float')

However, as you might have noticed on the extract, there are ' - ' hiding. Some represent the negative value of the whole number, such as -.494 and others represent a negative power, such as 9.8-3.

I need to replace the latters with 'E-' so Python understands it's a power and can switch the cell to a float type. Usually, I would use:

df= df.replace('E\-', '-', regex=True)

However, this would also add an E to my negative values. To avoid that, I tried the solution offered here: Replace all a in the middle of string by * using regex

str = 'JAYANTA POKED AGASTYA WITH BAAAAMBOO '
str = re.sub(r'\BA+\B', r'*', str)

However, this is for one specific string. As my dataframe is quite large, I would like to avoid having to go through each cell.

Is there a combination of the functions replace and re.sub I could use in order to only replace the'-' surrounded by other strings by 'E-'?

Thank you for your help!

SeaBean · Accepted Answer · 2021-06-28 14:04:44Z

2

You can use regex negative lookahead and positive lookahead to assert that the hyphen is in the middle for replace, as follows:

df = df.replace(r'\s', '', regex=True)      # remove any unwanted spaces 
df = df.replace(r'(?<=.)-(?=.)', 'E-', regex=True)

Result:

print(df)

        0         1         2         3          4         5       6
0  45589  664865.0  100000.0   7.62275      -.494   1.60149  100010
1  57205  718888.0  100000.0  8.218463  -1.405E-3   1.75137  100010
2  55143  711827.0  100000.0  8.156107  9.8336E-3  1.758051  100010

edited Jun 28, 2021 at 14:04

answered Jun 28, 2021 at 13:07

SeaBean

23.4k3 gold badges16 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

elle.delle Over a year ago

Hi ! Thanks for your answer. However, it did add a few 'E-' to my negative values so I get the error: "ValueError: could not convert string to float: ' E-.05613'". The weird part is that it partially worked though... Thanks for your input nonetheless

SeaBean Over a year ago

@elle.delle Would there be any leading spaces before the first hypen ?

SeaBean Over a year ago

@elle.delle See my edit to perform cleaning up of unwanted spaces before execution.

daveydave · Accepted Answer · 2021-06-28 13:09:27Z

1

Regular expressions can be expensive, perhaps slice the string into the first digit and remaining digits, use replace on the remaining digits, then recombine with the first digit. Haven't benchmarked this though! Something like this (applied with df.str_col.apply(lambda x: f(x))

my_str = '-1.23-4'
def f(x):
  first_part = my_str[0]
  remaining_part = my_str[1:]
  remaining_part = remaining_part.replace('-', 'E-')
  return first_part + remaining_part

Or as a one liner (assuming the seven columns are the only columns in your df, otherwise specify the columns):

df.apply(lambda x: x[0] + x[1:].replace('-', 'E-'))

edited Jun 28, 2021 at 13:09

answered Jun 28, 2021 at 13:03

daveydave

714 bronze badges

1 Comment

elle.delle Over a year ago

A bit more complex than @Osamoele 's answer, but works too, thanks a lot for your help :)

bruno-uy · Accepted Answer · 2021-06-28 13:10:49Z

1

I tried this example and worked:

import pandas as pd

df = pd.DataFrame({'A': ['-.494', '-1.405-3', '9.8336-3']})
pat = r"(\d)-"
repl = lambda m: f"{m.group(1)}e-"
df['A'] = df['A'].str.replace(pat, repl, regex=True)
df['A'] = pd.to_numeric(df['A'], errors='coerce')

answered Jun 28, 2021 at 13:10

bruno-uy

1,92514 silver badges20 bronze badges

1 Comment

elle.delle Over a year ago

This works, thanks a lot! The only downside is that it "just" does column by column but that is still a major improvement then from cell to cell. Thanks for your input, I really appreciate it!

Osamoele · Accepted Answer · 2021-09-17 20:58:37Z

1

You could use groups as specified in this thread, to select the number before you exponent so that :

first : the match only ocurs when the minus is preceded by values
and second : replace the match by E preceded by the values matched by the group (for example 158-3 will be replaced "dynamically" by the value 158 matched in group 1, with the expression \1 (group 1 content) and "statically" by E-.

This gives :

df.replace({r'(\d+)-' : r'\1E-'}, inplace=True, regex=True)

(You can verify it on regexp tester)

edited Sep 17, 2021 at 20:58

answered Jun 28, 2021 at 13:08

Osamoele

5161 gold badge7 silver badges22 bronze badges

5 Comments

bruno-uy Over a year ago

This doesn't work, but this works: df.replace({r'(\d)-' : r'\1E-'}, inplace=True, regex=True)

Osamoele Over a year ago

Thanks, I modified it in the answer.

elle.delle Over a year ago

Worked like a charm, it's possibly the easiest and most elegant answer too, thanks a lot!

SeaBean Over a year ago

@elle.delle This also matches - at the end of strings, instead of at the middle. Use with care!

elle.delle Over a year ago

@SeaBean Ooh thanks for the tip, I'll be careful about that!

Collectives™ on Stack Overflow

Replace '-' by 'E-' in dataframe cell IF the '-' is in the middle of a string

4 Answers 4

3 Comments

1 Comment

1 Comment

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related