0

There are a lot of similar questions, but I have not found a solution for my problem. I have a data frame with the following structure/form:

   col_1
0  BULKA TARTA 500G KAJO 1
1  CUKIER KRYSZTAL 1KG KSC 4
2  KASZA JĘCZMIENNA 4*100G 2 0.92
3  LEWIATAN MAKARON WSTĄŻKA 1 0.89

However, I want to achieve the effect:

   col_1
0  BULKA TARTA 500G KAJO
1  CUKIER KRYSZTAL 1KG KSC
2  KASZA JĘCZMIENNA 4*100G
3  LEWIATAN MAKARON WSTĄŻKA

So I want to remove the independent natural and decimal numbers, but leave the numbers in the string with the letters.

I tried to use df.col_1.str.isdigit().replace([True, False],[np.nan, df.col_1]) , but it only works on comparing the entire cell whether it is a number or not.

You have some ideas how to do it? Or maybe it would be good to break the column with spaces and then compare?

2
  • Sounds like you want regular expressions. Commented Nov 13, 2017 at 17:23
  • Updated my answer to also include a regex example. Hope it helped! Commented Nov 13, 2017 at 17:32

3 Answers 3

1

We could create a function that tries to convert to float. If it fails we return True (not_float)

import pandas as pd

df = pd.DataFrame({"col_1" : ["BULKA TARTA 500G KAJO 1",
                              "CUKIER KRYSZTAL 1KG KSC 4",
                              "KASZA JĘCZMIENNA 4*100G 2 0.92",
                              "LEWIATAN MAKARON WSTĄŻKA 1 0.89"]})

def is_not_float(string):
    try:
        float(string)
        return False
    except ValueError:  # String is not a number
        return True

df["col_1"] = df["col_1"].apply(lambda x: [i for i in x.split(" ") if is_not_float(i)])

df

Or following the example of my fellow SO:ers. However this would treat 130. as a number.

df["col_1"] = (df["col_1"].apply(
    lambda x: [i for i in x.split(" ") if not i.replace(".","").isnumeric()]))

Returns

                          col_1
0    [BULKA, TARTA, 500G, KAJO]
1  [CUKIER, KRYSZTAL, 1KG, KSC]
2   [KASZA, JĘCZMIENNA, 4*100G]
3  [LEWIATAN, MAKARON, WSTĄŻKA]
Sign up to request clarification or add additional context in comments.

3 Comments

This is my favorite solution. And how to use something like ' '.join(df2[i]) after each line, so that everything is connected again?
@TomaszPrzemski Sorry is that a question? What is df2 in this case?
Yes:) I introduced a new variable 'df2' instead df["col_1"], and later h=[] for i in range(len(df2)): h.append(' '.join(df2[i])) g = "\n".join(h) with open("C:\\Users\dell\\Desktop\\delikatesy\\wyczyszczone\\delikatesy_test1.csv", 'w', encoding='utf-8') as of: of.write(g). I'm just learning, but I think it's easier to do it :)
1

Sure,

You could use a regex.

import re
df.col_1 = re.sub("\d+\.?\d+?", "",  df.col_1)

Comments

0

Yes you can

def no_nums(col):
    return ' '.join(filter(lambda word:word.replace('.','').isdigit()==False, col.split()))
df.col_1.apply(no_nums)

This filters out words from each value which are completely made of digits,
And maybe contains a decimal point.
If you want to filter out numbers like 1,000, simply add another replace for ','

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.