1

I have the following SQL query for finding overlaps between begin and end for a particular note_id:

select a.*, b.*
from test.analytical_cui_mipacq_concepts_new a
inner join test.analytical_cui_mipacq_concepts_new b on ( 
    ( b.begin>=a.begin and b.begin<=a.end )
    or
    ( b.begin<=a.begin and b.end>=a.begin )
)
where ((a.system='metamap' and  b.system!=a.system) or (a.system='metamap' and  b.system=a.system and a.id_ != b.id_ and a.note_id = b.note_id))

that is taking forever and a day to run. I am trying to follow this thread to convert to a pandas merge: pandas-join-dataframe-with-condition

and I so far came up with (new is my original dataframe, note_id is how I identify a particular individual, and id_ is the pk from the db table):

a = new.copy()
b = new.copy()
b.columns

b = b.rename(index=str, columns={'end':'end_x', 'begin': 'begin_x', 'cui': 'cui_x', 
                                 'old_cui': 'old_cui_x', 'type': 'type_x', 
                                 'polarity': 'polarity_x', 'id_':'id_x'}) 

c = a.merge(b, how='inner', on=['note_id'])

print(len(a), len(b), len(c))
c.loc[(((c.begin >= c.begin_x) & (c.begin <= c.end_x)) 
       | ((c.begin<=b.begin_x) & (c.end>=c.begin_x))) &
      (((c.system=='metamap') &  (c.system!=c.system_x)) 
       | ((c.system_x=='metamap') & (c.system==c.system_x) 
          & (c.id_ != c.id_x) & (c.note_id == c.note_id_x)))]

When I run this, I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-e8c0d060f2a0> in <module>()
     32 print(len(a), len(b), len(c))
     33 c.loc[(((c.begin >= c.begin_x) & (c.begin <= c.end_x)) 
---> 34        | ((c.begin<=b.begin_x) & (c.end>=c.begin_x))) &
     35       (((c.system=='metamap') &  (c.system!=c.system_x)) 
     36        | ((c.system_x=='metamap') & (c.system==c.system_x) 

/anaconda3/lib/python3.7/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
   1674 
   1675         elif isinstance(other, ABCSeries) and not self._indexed_same(other):
-> 1676             raise ValueError("Can only compare identically-labeled "
   1677                              "Series objects")
   1678 

ValueError: Can only compare identically-labeled Series objects

Not exactly sure what this means, even after Googling around for it.

The data look like:

begin,polarity,end,note_id,type,system,cui,id_
31,1,37,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0004352,1
63,1,71,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0574032,2
81,1,86,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0039869,3
96,1,100,527982345,biomedicus.v2.UmlsConcept,biomedicus,C1123023,4
96,1,105,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0015230,5
101,1,105,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0015230,6
130,1,138,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0574032,7
143,1,144,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0184661,8
156,1,162,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0026591,9
176,1,185,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0004268,10
201,1,209,527982345,biomedicus.v2.UmlsConcept,biomedicus,C0574032,11
101,-1,116,527982345,org.metamap.uima.ts.Candidate,metamap,C0445223,168094
100,-1,116,527982345,org.metamap.uima.ts.Candidate,metamap,C0445223,168095
109,-1,116,527982345,org.metamap.uima.ts.Candidate,metamap,C0445223,168096
124,1,129,527982345,org.metamap.uima.ts.Candidate,metamap,C0205435,168097
124,1,129,527982345,org.metamap.uima.ts.Candidate,metamap,C1279901,168098
130,1,138,527982345,org.metamap.uima.ts.Candidate,metamap,C0574032,168099
130,1,138,527982345,org.metamap.uima.ts.Candidate,metamap,C1827465,168100
143,1,144,527982345,org.metamap.uima.ts.Candidate,metamap,C0021966,168101
143,1,144,527982345,org.metamap.uima.ts.Candidate,metamap,C0221138,168102
31,1,37,527982345,org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention,ctakes,C0004352,55414
599,1,603,527982345,org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention,ctakes,C0206655,55415
67,1,73,4069123471-4,org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention,ctakes,C3263723,55416
646,-1,650,527982345,org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention,ctakes,C0042109,55417
31,1,37,527982345,edu.uth.clamp.nlp.typesystem.ClampNameEntityUIMA,clamp,,32496
56,1,71,527982345,edu.uth.clamp.nlp.typesystem.ClampNameEntityUIMA,clamp,C0993666,32497
92,1,105,527982345,edu.uth.clamp.nlp.typesystem.ClampNameEntityUIMA,clamp,,32498
96,1,100,527982345,edu.uth.clamp.nlp.typesystem.ClampNameEntityUIMA,clamp,,32499
120,1,129,527982345,edu.uth.clamp.nlp.typesystem.ClampNameEntityUIMA,clamp,C2008415,32500
6
  • That means the Series a and b have different indexes, and pandas does not define Series comparison in this case. The same error occurs with the test a = pd.Series([1, 2], index=[0, 1]); b = pd.Series([1, 2], index=[0, 2]); a == b. Could you post a few lines of example data? Commented Mar 8, 2019 at 2:48
  • Done. I'm basically trying to find overlaps in my begin and end columns across a single note_id instance.. Commented Mar 8, 2019 at 3:11
  • 2
    can you post the data not as an image but as actual text so that we can paste it into our IDE's? thanks! Commented Mar 8, 2019 at 3:17
  • Done. Pasting from excel makes it an image, for some stupid reason. Commented Mar 8, 2019 at 3:25
  • 1
    you should probably sample your data given what you provided does not match some of the conditions you specify, such as system == 'metamap' Commented Mar 8, 2019 at 3:27

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.