2

I am trying to use regex to identify particular rows of a large pandas dataframe. Specifically, I intend to match the DOI of a paper to an xml ID that contains the DOI number.

# An example of the dataframe and a test doi:

                                    ScID.xml     journal  year    topic1
0    0009-3570(2017)050[0199:omfg]2.3.co.xml  Journal_1   2017  0.000007
1  0001-3568(2001)750[0199:smdhmf]2.3.co.xml  Journal_3   2001  0.000648
2  0002-3568(2004)450[0199:gissaf]2.3.co.xml  Journal_1   2004  0.000003
3  0003-3568(2011)150[0299:easayy]2.3.co.xml  Journal_1   2011  0.000003

# A dummy doi:

test_doi = '0002-3568(2004)450'

In this example case I would like to be able to return the index of the third row (2) by finding the partial match in the ScID.xml column. The DOI is not always at the beginning of the ScID.xml string.

I have searched this site and applied the methods described for similar scenarios.

Including:

df.iloc[:,0].apply(lambda x: x.contains(test_doi)).values.nonzero()

This returns:

AttributeError: 'str' object has no attribute 'contains'

and:

df.filter(regex=test_doi)

gives:

Empty DataFrame

Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 
55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 
91, 92, 93, 94, 95, 96, 97, 98, 99, ...]

[287459 rows x 0 columns]

and finally:

df.loc[:, df.columns.to_series().str.contains(test_doi).tolist()]

which also returns the Empty DataFrame as above.

All help is appreciated. Thank you.

1
  • 1
    See if the filter method works after changing test_doi to escape the paranthesis. i.e. test_doi = '0002-3568\(2004\)450'. Also, see this link for the correct regex usage. Commented Apr 26, 2018 at 12:19

1 Answer 1

5

There are two reasons why your first approach does not work:

First - If you use apply on a series the values in the lambda function will not be a series but a scalar. And because contains is a function from pandas and not from strings you get your error message.

Second - Brackets have a special meaning in a regex (the delimit a capture group). If you want the brackets as characters you have to escape them.

test_doi = '0002-3568\(2004\)450'
df.loc[df.iloc[:,0].str.contains(test_doi)]
                                    ScID.xml    journal  year    topic1
2  0002-3568(2004)450[0199:gissaf]2.3.co.xml  Journal_1  2004  0.000003

Bye the way, pandas filter function filters on the label of the index, not the values.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.