Using regex to obtain the row index of a partial match within a pandas dataframe column

Question

I am trying to use regex to identify particular rows of a large pandas dataframe. Specifically, I intend to match the DOI of a paper to an xml ID that contains the DOI number.

# An example of the dataframe and a test doi:

                                    ScID.xml     journal  year    topic1
0    0009-3570(2017)050[0199:omfg]2.3.co.xml  Journal_1   2017  0.000007
1  0001-3568(2001)750[0199:smdhmf]2.3.co.xml  Journal_3   2001  0.000648
2  0002-3568(2004)450[0199:gissaf]2.3.co.xml  Journal_1   2004  0.000003
3  0003-3568(2011)150[0299:easayy]2.3.co.xml  Journal_1   2011  0.000003

# A dummy doi:

test_doi = '0002-3568(2004)450'

In this example case I would like to be able to return the index of the third row (2) by finding the partial match in the ScID.xml column. The DOI is not always at the beginning of the ScID.xml string.

I have searched this site and applied the methods described for similar scenarios.

Including:

df.iloc[:,0].apply(lambda x: x.contains(test_doi)).values.nonzero()

This returns:

AttributeError: 'str' object has no attribute 'contains'

and:

df.filter(regex=test_doi)

gives:

Empty DataFrame

Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 
55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 
91, 92, 93, 94, 95, 96, 97, 98, 99, ...]

[287459 rows x 0 columns]

and finally:

df.loc[:, df.columns.to_series().str.contains(test_doi).tolist()]

which also returns the Empty DataFrame as above.

All help is appreciated. Thank you.

See if the filter method works after changing test_doi to escape the paranthesis. i.e. test_doi = '0002-3568\(2004\)450'. Also, see this link for the correct regex usage. — Matt.G
– Matt.G, Commented Apr 26, 2018 at 12:19

P.Tillmann · Accepted Answer · 2018-04-26 12:35:22Z

5

There are two reasons why your first approach does not work:

First - If you use apply on a series the values in the lambda function will not be a series but a scalar. And because contains is a function from pandas and not from strings you get your error message.

Second - Brackets have a special meaning in a regex (the delimit a capture group). If you want the brackets as characters you have to escape them.

test_doi = '0002-3568\(2004\)450'
df.loc[df.iloc[:,0].str.contains(test_doi)]
                                    ScID.xml    journal  year    topic1
2  0002-3568(2004)450[0199:gissaf]2.3.co.xml  Journal_1  2004  0.000003

Bye the way, pandas filter function filters on the label of the index, not the values.

edited Apr 26, 2018 at 12:35

answered Apr 26, 2018 at 12:28

P.Tillmann

2,12012 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Using regex to obtain the row index of a partial match within a pandas dataframe column

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related