0

I'm trying to make a transition from R to Python. One package that I heavily relied on was the data.table package. I am struggling to replicate this in Py/Pandas or just Python.

Update: included dummy data - thank you @cmaher for suggestion

import pandas
d = {'id': [1, 2, 3], 'x1': ['1_a', '1_b', 'NX']}
df = pd.DataFrame(data=d)
df

# R solution
library(data.table)
library(stringr)

df <- data.table(id = c(1,2,3), x1=c('1_a', '1_b', 'NX'))

df[str_detect(x1, '\\d') & !str_detect(x1, 'NX'), c("x2", "x3") := tstrsplit(x1, "_", fixed=TRUE)][!str_detect(x1, '\\d'), 'x3' := x1]

df
> df
   id  x1 x2 x3
1:  1 1_a  1  a
2:  2 1_b  1  b
3:  3  NX NA NX

# python-pandas attempt
df['x2'], df['x2'] = df['x1'].apply(
    lambda x: df['x1'].str.split('_', 1).str if (df['x1'].str.contains('\\d')) & 
    ~(df['x1'].str.contains('NX')) else df['x1'])
3
  • 3
    Please read how to make a good reproducible pandas examples. Questions such as this one are much more constructive if they include sample data & desired output, rather than just a code chunk to translate. Commented Mar 23, 2018 at 18:10
  • Do you want string to be separated by underscore or want to extract the number part of the string in x2 and string part in x3? Commented Mar 23, 2018 at 20:32
  • split by underscore mainly to do what you mentioned: x2 = number and x3 =string. Commented Mar 23, 2018 at 20:51

2 Answers 2

1

As I see in your comments, your intend is to separate numbers in x2 and strings in x3. Maybe the next code fit your requirements, using the 're' package:

import pandas as pd
import re
d = {'id': [1, 2, 3], 'x1': ['1_a', '1_b', 'NX']}
df = pd.DataFrame(data=d)
print(df)

def findPattern(pattern, string):
    m= re.search(pattern,string)
    if m:
        return m.group()
    else:
        return None

df['x2'] = df.x1.apply(lambda x: findPattern(r"\d+",x)) 
df['x3'] = df.x1.apply(lambda x: findPattern(r"[a-zA-Z]+",x))

print(df)

The output:

   id   x1    x2  x3
0   1  1_a     1   a
1   2  1_b     1   b
2   3   NX  None  NX
Sign up to request clarification or add additional context in comments.

Comments

1

So are you looking for something like this?

import pandas as pd
import numpy as np

df = pd.DataFrame({'id': [1, 2, 3], 'x1': ['1_a', '1_b', 'NX']})
df['x2'], df['x3'] = df['x1'].str.split('_', 1).str
df.loc[df['x3'].isnull(),'x3'] = df['x1']
df['x2'] = df['x2'].replace(df['x1'],np.nan)
df

out:

    id  x1  x2  x3
0   1   1_a 1   a
1   2   1_b 1   b
2   3   NX  NaN NX

2 Comments

Sorry the NA is R's equivalent to 'missing data'.
@user2340706 this should work for you. it separates each string in df[x1] on '_' the default for df['x3'] is df['x1'] df['x2'] is NULL if there is no _ on which to split.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.