Slice strings in python pandas DataFrame based on array of integers

Question

I want to slice a column in a dataframe (which contains only strings) based on the integers from a series. Here is an example:

data = pandas.DataFrame(['abc','scb','dvb'])
indices = pandas.Series([0,1,0])

Then apply some function so I get the following:

Vaishali · Accepted Answer · 2017-01-25 00:15:12Z

1

You can use python to manipulate the lists beforehand.

l1 = ['abc','scb','dvb']
l2 = [0,1,0]
l3 = [l1[i][l2[i]] for i in range(len(l1))]

You get l3 as

['a', 'c', 'd']

Now converting it to DataFrame

data = pd.DataFrame(l3)

You get the desired dataframe

answered Jan 25, 2017 at 0:15

Vaishali

38.5k5 gold badges62 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

MaxU - stand with Ukraine Over a year ago

this is an interesting idea. If you could implemet it using numpy - it could be pretty fast...

Vaishali Over a year ago

Not a numpy geek yet but let me try. Thanks for the reply:)

murphycj Over a year ago

Thanks, this seems to be for the more generalizable solution. I say that because I have another case where I might want to get a range slice (i.e. I want multiple letters for each row in the final output data frame). I could not find a way to adapt the other solution from @MaxU

MaxU - stand with Ukraine · Accepted Answer · 2017-01-25 00:43:54Z

1

You can use the following vectorized approach:

In [191]: [tuple(x) for x in indices.reset_index().values]
Out[191]: [(0, 0), (1, 1), (2, 0)]

In [192]: data[0].str.extractall(r'(.)') \
                 .loc[[tuple(x) for x in indices.reset_index().values]]
Out[192]:
         0
  match
0 0      a
1 1      c
2 0      d

In [193]: data[0].str.extractall(r'(.)') \
                 .loc[[tuple(x) for x in indices.reset_index().values]] \
                 .reset_index(level=1, drop=True)
Out[193]:
   0
0  a
1  c
2  d

Explanation:

In [194]: data[0].str.extractall(r'(.)')
Out[194]:
         0
  match
0 0      a
  1      b
  2      c
1 0      s
  1      c
  2      b
2 0      d
  1      v
  2      b

In [195]: data[0].str.extractall(r'(.)').loc[ [ (0,0), (1,1) ] ]
Out[195]:
         0
  match
0 0      a
1 1      c

Numpy solution:

In [259]: a = np.array([list(x) for x in data.values.reshape(1, len(data))[0]])

In [260]: a
Out[260]:
array([['a', 'b', 'c'],
       ['s', 'c', 'b'],
       ['d', 'v', 'b']],
      dtype='<U1')

In [263]: pd.Series(a[np.arange(len(data)), indices])
Out[263]:
0    a
1    c
2    d
dtype: object

edited Jan 25, 2017 at 0:43

answered Jan 25, 2017 at 0:07

MaxU - stand with Ukraine

212k37 gold badges402 silver badges437 bronze badges

1 Comment

murphycj Over a year ago

Thanks, runs pretty quick on the larger dataset I' am applying it to as well.

Collectives™ on Stack Overflow

Slice strings in python pandas DataFrame based on array of integers

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related