2

I have a very large knowledge graph in pandas dataframe format as follows.

This dataframe KG has more than 100 million rows:

                   pred     subj      obj
        0   nationality     BART      USA
        1  placeOfBirth     BART  NEWYORK
        2     locatedIn  NEWYORK      USA
      ...           ...      ...      ...
116390740     hasFather     BART   HOMMER
116390741   nationality   HOMMER      USA
116390743  placeOfBirth   HOMMER  NEWYORK

I tried to get a row from this KG with a specific value for subj and obj.

a) I tried indexing into KG by generating a boolean series using isin() function:

KG[KG['subj'].isin(['BART', 'NEWYORK']) & KG['obj'].isin(['USA', 'HOMMER'])]

b) I also tried indexing the KG using query() function:

KG = KG.set_index(['subj','obj'], drop=True)
KG = KG.sort_index()
subj_substitution = ['BART', 'NEWYORK']
obj_substitution= ['USA', 'HOMMER']    
KG.query(f"subj in {subj_substitution} & obj in {obj_substitution}

c) And I also tried to join two DataFrames using a merge() as shown below.

subj_df

      subj
0     BART
1  NEWYORK


obj_df

      obj
0     USA
1  HOMMER

merge_result = pd.merge(KG, subj_df, on = ['subj']).drop_duplicates()
merge_result = pd.merge(merge_result, obj_df, on = ['obj']).drop_duplicates()

These methods result in the following:

                   pred     subj      obj
        0   nationality     BART      USA
        2     locatedIn  NEWYORK      USA
116390740     hasFather     BART   HOMMER

I used the timeit function to check the time for each as shown below.

timeit.timeit(lambda: KG[(KG['subj'].isin(['BART', 'NEWYORK']) & (KG['obj'].isin(['USA', 'HOMMER'])))] , number=10)

The runtimes were:

function runtime
isin() 35.6s
query() 155.2s
merge() 288.9s

I think isin() is the fastest way to index a very large Dataframe. I would appreciate it if you could tell me a faster way than this.

6
  • pred,subj,obj are all strings with low cardinality. Convert them to pd.Categorical, then they'll get represented as integers under-the-hood. If you can post say 1K rows of your dataset as an attachment, I'll post the code. Commented May 10, 2021 at 4:51
  • Actually just tell us each column's cardinality: KG.apply(pd.Series.nunique, axis=0) Commented May 10, 2021 at 5:11
  • Won chul Shin: but that's just the first 6 lines replicated many times. It's not going to exercise the cardinality (small number of unique values). Commented May 10, 2021 at 8:02
  • Sorry I'll give you some new data, so you can try this. Thank you for your help. drive.google.com/file/d/1rNjvvUJxM4LCn9qnWdyOhlWtwK--A577/… Commented May 10, 2021 at 10:39
  • Can you please tell us each column's cardinality, as asked: KG.apply(pd.Series.nunique, axis=0)? Commented May 11, 2021 at 0:56

1 Answer 1

3

I would personally go with isin or query with in.

Pandas doc says:

Performance of query()

DataFrame.query() using numexpr is slightly faster than Python for large frames. Note: You will only see the performance benefits of using the numexpr engine with DataFrame.query() if your frame has more than approximately 200,000 rows.

Details about query can be found here

In your example when I tested KG dataframe with shape (50331648, 3) - 50M+ rows and 3 column using query and isin the performance results were almost same.

isin

%timeit KG[KG['subj'].isin(['BART', 'NEWYORK']) & KG['obj'].isin(['USA', 'HOMMER'])]
4.14 s ± 83.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

query with in operator

%timeit KG.query("(subj in ['BART', 'NEWYORK']) and (obj in ['USA', 'HOMMER'])")
4.08 s ± 82.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

query with isin

%timeit KG.query("(subj.isin(['BART', 'NEWYORK']))& (obj.isin(['USA', 'HOMMER']))")
4.99 s ± 210 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Test Data

d="""pred,subj,obj
nationality,BART,USA
placeOfBirth,BART,NEWYORK
locatedIn,NEWYORK,USA
hasFather,BART,HOMMER
nationality,HOMMER,USA
placeOfBirth,HOMMER,NEWYORK"""
KG = pd.read_csv(StringIO(d))
for i in range(23):
    KG = pd.concat([KG,KG])
KG.shape # (50331648, 3)

If performance + code readability(maintenance) is concerned, then atleast for complex queries I would go with query function.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.