Fastest way to index a very large Pandas dataframe

Question

I have a very large knowledge graph in pandas dataframe format as follows.

This dataframe KG has more than 100 million rows:

                   pred     subj      obj
        0   nationality     BART      USA
        1  placeOfBirth     BART  NEWYORK
        2     locatedIn  NEWYORK      USA
      ...           ...      ...      ...
116390740     hasFather     BART   HOMMER
116390741   nationality   HOMMER      USA
116390743  placeOfBirth   HOMMER  NEWYORK

I tried to get a row from this KG with a specific value for subj and obj.

a) I tried indexing into KG by generating a boolean series using isin() function:

KG[KG['subj'].isin(['BART', 'NEWYORK']) & KG['obj'].isin(['USA', 'HOMMER'])]

b) I also tried indexing the KG using query() function:

KG = KG.set_index(['subj','obj'], drop=True)
KG = KG.sort_index()
subj_substitution = ['BART', 'NEWYORK']
obj_substitution= ['USA', 'HOMMER']    
KG.query(f"subj in {subj_substitution} & obj in {obj_substitution}

c) And I also tried to join two DataFrames using a merge() as shown below.

subj_df

      subj
0     BART
1  NEWYORK


obj_df

      obj
0     USA
1  HOMMER

merge_result = pd.merge(KG, subj_df, on = ['subj']).drop_duplicates()
merge_result = pd.merge(merge_result, obj_df, on = ['obj']).drop_duplicates()

These methods result in the following:

                   pred     subj      obj
        0   nationality     BART      USA
        2     locatedIn  NEWYORK      USA
116390740     hasFather     BART   HOMMER

I used the timeit function to check the time for each as shown below.

timeit.timeit(lambda: KG[(KG['subj'].isin(['BART', 'NEWYORK']) & (KG['obj'].isin(['USA', 'HOMMER'])))] , number=10)

The runtimes were:

function	runtime
`isin()`	35.6s
`query()`	155.2s
`merge()`	288.9s

I think isin() is the fastest way to index a very large Dataframe. I would appreciate it if you could tell me a faster way than this.

pred,subj,obj are all strings with low cardinality. Convert them to pd.Categorical, then they'll get represented as integers under-the-hood. If you can post say 1K rows of your dataset as an attachment, I'll post the code. — smci
– smci, Commented May 10, 2021 at 4:51
Actually just tell us each column's cardinality: KG.apply(pd.Series.nunique, axis=0) — smci
– smci, Commented May 10, 2021 at 5:11
Won chul Shin: but that's just the first 6 lines replicated many times. It's not going to exercise the cardinality (small number of unique values). — smci
– smci, Commented May 10, 2021 at 8:02
Sorry I'll give you some new data, so you can try this. Thank you for your help. drive.google.com/file/d/1rNjvvUJxM4LCn9qnWdyOhlWtwK--A577/… — Won chul Shin
– Won chul Shin, Commented May 10, 2021 at 10:39
Can you please tell us each column's cardinality, as asked: KG.apply(pd.Series.nunique, axis=0)? — smci
– smci, Commented May 11, 2021 at 0:56

Utsav · Accepted Answer · 2021-05-10 04:12:08Z

I would personally go with isin or query with in.

Pandas doc says:

Performance of query()

DataFrame.query() using numexpr is slightly faster than Python for large frames. Note: You will only see the performance benefits of using the numexpr engine with DataFrame.query() if your frame has more than approximately 200,000 rows.

Details about query can be found here

In your example when I tested KG dataframe with shape (50331648, 3) - 50M+ rows and 3 column using query and isin the performance results were almost same.

isin

%timeit KG[KG['subj'].isin(['BART', 'NEWYORK']) & KG['obj'].isin(['USA', 'HOMMER'])]
4.14 s ± 83.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

query with in operator

%timeit KG.query("(subj in ['BART', 'NEWYORK']) and (obj in ['USA', 'HOMMER'])")
4.08 s ± 82.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

query with isin

%timeit KG.query("(subj.isin(['BART', 'NEWYORK']))& (obj.isin(['USA', 'HOMMER']))")
4.99 s ± 210 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Test Data

d="""pred,subj,obj
nationality,BART,USA
placeOfBirth,BART,NEWYORK
locatedIn,NEWYORK,USA
hasFather,BART,HOMMER
nationality,HOMMER,USA
placeOfBirth,HOMMER,NEWYORK"""
KG = pd.read_csv(StringIO(d))
for i in range(23):
    KG = pd.concat([KG,KG])
KG.shape # (50331648, 3)

If performance + code readability(maintenance) is concerned, then atleast for complex queries I would go with query function.

Collectives™ on Stack Overflow

Fastest way to index a very large Pandas dataframe

1 Answer 1

Pandas doc says:

Performance of query()

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Pandas doc says:

Performance of query()

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related