3

I have a huge dataframe composed of 7 columns. Extract:

45589   664865.0    100000.0    7.62275    -.494     1.60149      100010
...
57205   718888.0    100000.0    8.218463    -1.405-3     1.75137      100010
...
55143   711827.0    100000.0    8.156107    9.8336-3    1.758051      100010

As these values come from an input file, there are currently all of string type and I would like to change all the dataframe to float through :

df= df.astype('float')

However, as you might have noticed on the extract, there are ' - ' hiding. Some represent the negative value of the whole number, such as -.494 and others represent a negative power, such as 9.8-3.

I need to replace the latters with 'E-' so Python understands it's a power and can switch the cell to a float type. Usually, I would use:

df= df.replace('E\-', '-', regex=True)

However, this would also add an E to my negative values. To avoid that, I tried the solution offered here: Replace all a in the middle of string by * using regex

str = 'JAYANTA POKED AGASTYA WITH BAAAAMBOO '
str = re.sub(r'\BA+\B', r'*', str)

However, this is for one specific string. As my dataframe is quite large, I would like to avoid having to go through each cell.

Is there a combination of the functions replace and re.sub I could use in order to only replace the'-' surrounded by other strings by 'E-'?

Thank you for your help!

4 Answers 4

2

You can use regex negative lookahead and positive lookahead to assert that the hyphen is in the middle for replace, as follows:

df = df.replace(r'\s', '', regex=True)      # remove any unwanted spaces 
df = df.replace(r'(?<=.)-(?=.)', 'E-', regex=True)

Result:

print(df)

        0         1         2         3          4         5       6
0  45589  664865.0  100000.0   7.62275      -.494   1.60149  100010
1  57205  718888.0  100000.0  8.218463  -1.405E-3   1.75137  100010
2  55143  711827.0  100000.0  8.156107  9.8336E-3  1.758051  100010
Sign up to request clarification or add additional context in comments.

3 Comments

Hi ! Thanks for your answer. However, it did add a few 'E-' to my negative values so I get the error: "ValueError: could not convert string to float: ' E-.05613'". The weird part is that it partially worked though... Thanks for your input nonetheless
@elle.delle Would there be any leading spaces before the first hypen ?
@elle.delle See my edit to perform cleaning up of unwanted spaces before execution.
1

Regular expressions can be expensive, perhaps slice the string into the first digit and remaining digits, use replace on the remaining digits, then recombine with the first digit. Haven't benchmarked this though! Something like this (applied with df.str_col.apply(lambda x: f(x))

my_str = '-1.23-4'
def f(x):
  first_part = my_str[0]
  remaining_part = my_str[1:]
  remaining_part = remaining_part.replace('-', 'E-')
  return first_part + remaining_part

Or as a one liner (assuming the seven columns are the only columns in your df, otherwise specify the columns):

df.apply(lambda x: x[0] + x[1:].replace('-', 'E-'))

1 Comment

A bit more complex than @Osamoele 's answer, but works too, thanks a lot for your help :)
1

I tried this example and worked:

import pandas as pd

df = pd.DataFrame({'A': ['-.494', '-1.405-3', '9.8336-3']})
pat = r"(\d)-"
repl = lambda m: f"{m.group(1)}e-"
df['A'] = df['A'].str.replace(pat, repl, regex=True)
df['A'] = pd.to_numeric(df['A'], errors='coerce')

1 Comment

This works, thanks a lot! The only downside is that it "just" does column by column but that is still a major improvement then from cell to cell. Thanks for your input, I really appreciate it!
1

You could use groups as specified in this thread, to select the number before you exponent so that :

  • first : the match only ocurs when the minus is preceded by values
  • and second : replace the match by E preceded by the values matched by the group (for example 158-3 will be replaced "dynamically" by the value 158 matched in group 1, with the expression \1 (group 1 content) and "statically" by E-.

This gives :

df.replace({r'(\d+)-' : r'\1E-'}, inplace=True, regex=True)

(You can verify it on regexp tester)

5 Comments

This doesn't work, but this works: df.replace({r'(\d)-' : r'\1E-'}, inplace=True, regex=True)
Thanks, I modified it in the answer.
Worked like a charm, it's possibly the easiest and most elegant answer too, thanks a lot!
@elle.delle This also matches - at the end of strings, instead of at the middle. Use with care!
@SeaBean Ooh thanks for the tip, I'll be careful about that!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.