-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Closed
Labels
IntervalInterval data typeInterval data typePerformanceMemory or execution speed performanceMemory or execution speed performance
Milestone
Description
When using cut with an IntervalIndex for bins the result of the cut is first materialized as an IntervalIndex and then converted to a Categorical:
pandas/pandas/core/reshape/tile.py
Lines 373 to 378 in 143bc34
| if isinstance(bins, IntervalIndex): | |
| # we have a fast-path here | |
| ids = bins.get_indexer(x) | |
| result = algos.take_nd(bins, ids) | |
| result = Categorical(result, categories=bins, ordered=True) | |
| return result, bins |
It seems like it'd be more performant from a computational and memory standpoint to bypass the intermediate construction of an IntervalIndex via take_nd and instead directly construct the Categorical via Categorical.from_codes.
Some ad hoc measurements on master:
In [3]: ii = pd.interval_range(0, 20)
In [4]: values = np.linspace(0, 20, 100).repeat(10**4)
In [5]: %timeit pd.cut(values, ii)
7.69 s ± 43.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %memit pd.cut(values, ii)
peak memory: 278.39 MiB, increment: 130.76 MiBAnd the same measurements with the Categorical.from_codes fix:
In [3]: ii = pd.interval_range(0, 20)
In [4]: values = np.linspace(0, 20, 100).repeat(10**4)
In [5]: %timeit pd.cut(values, ii)
1.02 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %memit pd.cut(values, ii)
peak memory: 145.81 MiB, increment: 15.98 MiBMetadata
Metadata
Assignees
Labels
IntervalInterval data typeInterval data typePerformanceMemory or execution speed performanceMemory or execution speed performance