How can I subclass a Pandas DataFrame?

Question

Subclassing Pandas classes seems a common need, but I could not find references on the subject. (It seems that Pandas developers are still working on it: Easier subclassing #60.)

There are some SO questions on the subject, but I am hoping that someone here can provide a more systematic account on the current best way to subclass pandas.DataFrame that satisfies two general requirements:

calling standard DataFrame methods on instances of MyDF should produce instances of MyDF
calling standard DataFrame methods on instances of MyDF should leave all attributes still attached to the output

(And are there any significant differences for subclassing pandas.Series?)

Code for subclassing pd.DataFrame:

import numpy as np
import pandas as pd

class MyDF(pd.DataFrame):
    # how to subclass pandas DataFrame?
    pass

mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])
print(type(mydf))  # <class '__main__.MyDF'>

# Requirement 1: Instances of MyDF, when calling standard methods of DataFrame,
# should produce instances of MyDF.
mydf_sub = mydf[['A','C']]
print(type(mydf_sub))  # <class 'pandas.core.frame.DataFrame'>

# Requirement 2: Attributes attached to instances of MyDF, when calling standard
# methods of DataFrame, should still attach to the output.
mydf.myattr = 1
mydf_cp1 = MyDF(mydf)
mydf_cp2 = mydf.copy()
print(hasattr(mydf_cp1, 'myattr'))  # False
print(hasattr(mydf_cp2, 'myattr'))  # False

see here for a nice example: github.com/kjordahl/geopandas; note that in general IMHO their isn't a reason to ever sub-class, composition works much better, is more flexible, and offers more benefits. — Jeff
– Jeff, Commented Mar 3, 2014 at 20:00
I think there are reasons to want to subclass, atm it doesn't work, as stated in the linked issue - it's never been priority (though some work has been done towards it...) — Andy Hayden
– Andy Hayden, Commented Mar 3, 2014 at 22:06
@Jeff It seems to me that inheritance is a fundamental feature of object oriented programming, independent of anyone's views about composition vs inheritance. The difficulty of subclassing DataFrame makes using the package significantly less attractive to me and I guess many others, judging from the issue reports on the pandas GitHub page. — Dave Kielpinski
– Dave Kielpinski, Commented Aug 10, 2017 at 23:37
@Jeff I also have a nontrivial codebase. I am not in a position to chase down whether the patch has propagated through all the import statements in all the modules. — Dave Kielpinski
– Dave Kielpinski, Commented Aug 11, 2017 at 0:20

wjandrea · Accepted Answer · 2024-09-16 15:23:21Z

49

There is now an official guide on how to subclass Pandas data structures, which includes DataFrame as well as Series.

The guide mentions this subclassed DataFrame from the Geopandas project as a good example.

As in HYRY's answer, it seems there are two things you're trying to accomplish:

When calling methods on an instance of your class, return instances of the correct type (your type). For this, you can just add the _constructor property which should return your type.
Adding attributes which will be attached to copies of your object. To do this, you need to store the names of these attributes in a list, as the special _metadata attribute.

Here's an example:

class SubclassedDataFrame(DataFrame):
    _metadata = ['added_property']
    added_property = 1  # This will be passed to copies

    @property
    def _constructor(self):
        return SubclassedDataFrame

edited Sep 16, 2024 at 15:23

wjandrea

34k10 gold badges69 silver badges105 bronze badges

answered Feb 25, 2016 at 6:22

cjrieds

8978 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

pauljohn32 Over a year ago

It is ambiguous whether _metadata refers to class variables or instance variables. This example has a class var. Can somebody clarify about self.?? vars?

pauljohn32 Over a year ago

The finalize method solves Requirement 2 when objects are merged or concat-ed. I figured out by imitating the GeoPandas code, just search for it and the fix is pretty clear to see.

Peter Mortensen · Accepted Answer · 2021-07-14 09:21:05Z

18

For Requirement 1, just define _constructor:

import pandas as pd
import numpy as np

class MyDF(pd.DataFrame):
    @property
    def _constructor(self):
        return MyDF


mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])
print type(mydf)

mydf_sub = mydf[['A','C']]
print type(mydf_sub)

I think there is no simple solution for Requirement 2. I think you need define __init__, copy, or do something in _constructor, for example:

import pandas as pd
import numpy as np

class MyDF(pd.DataFrame):
    _attributes_ = "myattr1,myattr2"

    def __init__(self, *args, **kw):
        super(MyDF, self).__init__(*args, **kw)
        if len(args) == 1 and isinstance(args[0], MyDF):
            args[0]._copy_attrs(self)

    def _copy_attrs(self, df):
        for attr in self._attributes_.split(","):
            df.__dict__[attr] = getattr(self, attr, None)

    @property
    def _constructor(self):
        def f(*args, **kw):
            df = MyDF(*args, **kw)
            self._copy_attrs(df)
            return df
        return f

mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])
print type(mydf)

mydf_sub = mydf[['A','C']]
print type(mydf_sub)

mydf.myattr1 = 1
mydf_cp1 = MyDF(mydf)
mydf_cp2 = mydf.copy()
print mydf_cp1.myattr1, mydf_cp2.myattr1

edited Jul 14, 2021 at 9:21

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Mar 4, 2014 at 1:22

HYRY

97.8k28 gold badges197 silver badges192 bronze badges

1 Comment

Andy Hayden Over a year ago

It seems to me that you'd often what to have a corresponding subclass of Series at the same time (i.e. have them MyDF and MyS link in some way so e.g. mydf.sum() returns a MyS...)

FabienPv · Accepted Answer · 2024-09-18 14:31:12Z

I went through a similar problem. My solution may be incomplete as I have not tested all functions from the class pandas DataFrame to verify how they behave with my subclass. I write it down here, in case it would be useful to someone.

Specify pd.DataFrame as the parent class
Override the methods from pd.DataFrame such as to return a new instance of my subclass instead of an instance of pd.DataFrame.
Override __getitem__ in a similar manner as the other method (but for some reason it does not work if done by overwritten)

import pandas as pd

class MyDF(pd.DataFrame):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def __make_func(self, attrib):
        def func(*args, **kwargs):
            result = getattr(super(MyDF, self), attrib)(*args, **kwargs)
            if isinstance(result, pd.DataFrame):
                return set_mydf(result)
            return result
        return func

    def overriding(self):
        for attrib in [func for func in dir(pd.DataFrame)]:
            if attrib not in ["__getitem__"]:
                if callable(getattr(pd.DataFrame, attrib)):
                    self.__dict__[attrib] = self.__make_func(attrib)

    def __getitem__(self, key):
        result = getattr(super(MyDF, self), "__getitem__")(key)
        if isinstance(result, pd.DataFrame):
            return set_mydf(result)
        return result

    def operation(self):
        mydf = self
        print("-1-", type(mydf))
        mydf = mydf[["a", "b"]].apply(lambda x: x*10)
        print("-2-", type(mydf))
        return set_mydf(mydf)

def set_mydf(*args, **kwargs) -> mydf:
    df = MyDF(*args, **kwargs)
    df.overriding()
    return df

My tests:

mydf = MyDF(data={"a":[0,1,2,3], "b":[4,5,6,7], "c":[8,9,10,11], "d":[12,13,14,15]})
print(mydf)
print(type(mydf))

   a  b   c   d
0  0  4   8  12
1  1  5   9  13
2  2  6  10  14
3  3  7  11  15
<class '__main__.MyDF'>

mydf.overriding()

# If we apply a method from pd.Dataframe, the returned result is the subclass instance, as wanted.
mydf = mydf.apply(lambda x: x*10)
print(mydf)
print(type(mydf))

    a   b    c    d
0   0  40   80  120
1  10  50   90  130
2  20  60  100  140
3  30  70  110  150
<class '__main__.MyDF'>

# Correct result when we change a value in a cell
mydf.loc[0, "a"] = 99
print(mydf)
print(type(mydf))

    a   b    c    d
0  99  40   80  120
1  10  50   90  130
2  20  60  100  140
3  30  70  110  150
<class '__main__.MyDF'>

# Correct when we add a column
mydf["e"] = [0]*mydf.shape[0]
print(mydf)
print(type(mydf))

    a   b    c    d  e
0  99  40   80  120  0
1  10  50   90  130  0
2  20  60  100  140  0
3  30  70  110  150  0
<class '__main__.MyDF'>

# Correct with a custom function inside the class
mydf = mydf.operation()
print(mydf)
print(type(mydf))

-1- <class '__main__.MyDF'>
-2- <class 'pandas.core.frame.DataFrame'>
     a    b
0  990  400
1  100  500
2  200  600
3  300  700
<class '__main__.MyDF'>

# This method returns a pd.Series, that was predictable
mydf = mydf["a"].apply(lambda x: x/8)
print(mydf)
print(type(mydf))

0    123.75
1     12.50
2     25.00
3     37.50
Name: a, dtype: float64
<class 'pandas.core.series.Series'>

mydf = MyDF(data={"a":[0,1,2,3], "b":[4,5,6,7], "c":[8,9,10,11], "d":[12,13,14,15]})
print(mydf)
print(type(mydf))

# Correct thanks to the custom __getitem__
mydf = mydf[["a"]]

   a
0  0
1  1
2  2
3  3
<class '__main__.MyDF'>

Collectives™ on Stack Overflow

How can I subclass a Pandas DataFrame?

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related