0

I would like to put some standard tasks for a panda dataframe like initialize with data and process this data into a class. I am currently performing the following sample steps:

import pandas as pd
import urllib.request


def __get_data():
    URL = r'https://en.wikipedia.org/wiki/List_of_sovereign_states_' \
          r'and_dependent_territories_by_continent_(data_file)#Data_file'
    HTML_STRING = urllib.request.urlopen(URL)
    return pd.read_html(HTML_STRING)[2]


def __prepare_data(df):
    df.iloc[:,-1] = df.iloc[:,-1].str.upper()
    return df


MyDataFrame = pd.DataFrame()
MyDataFrame = __get_data()
MyDataFrame = __prepare_data(MyDataFrame)

I'd like something like that:

class MyDataFrame(pd.DataFrame):
    def __init__(self, *args, **kwargs):
        super(MyDataFrame, self).__init__(*args, **kwargs)
        self = self.__get_data()
        self.__prepare_data()

    def __get_data(self):
        URL = r'https://en.wikipedia.org/wiki/List_of_sovereign_states_' \
              r'and_dependent_territories_by_continent_(data_file)#Data_file'
        HTML_STRING = urllib.request.urlopen(URL)
        return pd.read_html(HTML_STRING)[2]

    def __prepare_data(self):
        self.iloc[:, -1] = self.iloc[:, -1].str.upper()

Unfortunately I do not understand the Pandas documentation in this context.

1 Answer 1

1

While I think this is ill-advised, this modification works:

class MyDataFrame(pd.DataFrame):
    def __init__(self, *args, **kwargs):
        super(MyDataFrame, self).__init__(*args, **kwargs)
        self.data = self.__get_data()
        self.__prepare_data()

    def __get_data(self):
        URL = r'https://en.wikipedia.org/wiki/List_of_sovereign_states_' \
              r'and_dependent_territories_by_continent_(data_file)#Data_file'
        HTML_STRING = urllib.request.urlopen(URL)
        return pd.read_html(HTML_STRING)[2]

    def __prepare_data(self):
        self.data.iloc[:, -1] = self.data.iloc[:, -1].str.upper()

d = MyDataFrame()

print(d.data)

Output:

    CC  a-2 a-3     #       Name
0   AS  AF  AFG     4.0     AFGHANISTAN, ISLAMIC REPUBLIC OF
1   EU  AL  ALB     8.0     ALBANIA, REPUBLIC OF
2   AN  AQ  ATA     10.0    ANTARCTICA (THE TERRITORY SOUTH OF 60 DEG S)
3   AF  DZ  DZA     12.0    ALGERIA, PEOPLES DEMOCRATIC REPUBLIC OF
4   OC  AS  ASM     16.0    AMERICAN SAMOA
...     ...     ...     ...     ...     ...
257     AF  ZM  ZMB 894.0   ZAMBIA, REPUBLIC OF
258     AS  XD  NaN NaN     UNITED NATIONS NEUTRAL ZONE
259     AS  XE  NaN NaN     IRAQ-SAUDI ARABIA NEUTRAL ZONE
260     AS  XS  NaN NaN     SPRATLY ISLANDS
261     OC  XX  NaN NaN     DISPUTED TERRITORY
Sign up to request clarification or add additional context in comments.

4 Comments

Why do you think it is ill-advised? :)
There's no reason at all to subclass the dataframe here (there rarely is). But OP is a newbie, I don't want to be too critical.
@JoshFriedlander I took to heart your advice that a subclass is rarely necessary for dataframes and implemented an alternative solution using @pd.api.extensions.register_dataframe_accessor() Pandas doc . Would this be the better way to establish user specific methods for a dataframe like get_data() and prepare_data() from my code example? What would be a good way to implement something like this?
The way I see it, fetching the data is not a function of the dataframe. Nor is preparing the data. If you need to use an object, just create a DataProvider class that has a data attribute. It fetches and cleans the data and stores the output df as self.data

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.