I have a very large knowledge graph in pandas dataframe format as follows.
This dataframe KG has more than 100 million rows:
pred subj obj
0 nationality BART USA
1 placeOfBirth BART NEWYORK
2 locatedIn NEWYORK USA
... ... ... ...
116390740 hasFather BART HOMMER
116390741 nationality HOMMER USA
116390743 placeOfBirth HOMMER NEWYORK
I tried to get a row from this KG with a specific value for subj and obj.
a) I tried indexing into KG by generating a boolean series using isin() function:
KG[KG['subj'].isin(['BART', 'NEWYORK']) & KG['obj'].isin(['USA', 'HOMMER'])]
b) I also tried indexing the KG using query() function:
KG = KG.set_index(['subj','obj'], drop=True)
KG = KG.sort_index()
subj_substitution = ['BART', 'NEWYORK']
obj_substitution= ['USA', 'HOMMER']
KG.query(f"subj in {subj_substitution} & obj in {obj_substitution}
c) And I also tried to join two DataFrames using a merge() as shown below.
subj_df
subj
0 BART
1 NEWYORK
obj_df
obj
0 USA
1 HOMMER
merge_result = pd.merge(KG, subj_df, on = ['subj']).drop_duplicates()
merge_result = pd.merge(merge_result, obj_df, on = ['obj']).drop_duplicates()
These methods result in the following:
pred subj obj
0 nationality BART USA
2 locatedIn NEWYORK USA
116390740 hasFather BART HOMMER
I used the timeit function to check the time for each as shown below.
timeit.timeit(lambda: KG[(KG['subj'].isin(['BART', 'NEWYORK']) & (KG['obj'].isin(['USA', 'HOMMER'])))] , number=10)
The runtimes were:
| function | runtime |
|---|---|
isin() |
35.6s |
query() |
155.2s |
merge() |
288.9s |
I think isin() is the fastest way to index a very large Dataframe.
I would appreciate it if you could tell me a faster way than this.
pred,subj,objare all strings with low cardinality. Convert them topd.Categorical, then they'll get represented as integers under-the-hood. If you can post say 1K rows of your dataset as an attachment, I'll post the code.KG.apply(pd.Series.nunique, axis=0)KG.apply(pd.Series.nunique, axis=0)?