0

I have a dataframe as df, i want to split my activities into different functions so that i can use those functions into future programs

# check if dataframe has duplicates
    def duplicate_check ():
        global df
        df = df.drop_duplicates(['datetime', 'tagname'])
        df.drop(['tagname'], axis=1, inplace=True)
        return df

    df = duplicate_check()

# Split my dataframe array column to individual column
    def array_split():
        global df
        date = df['datetime']
        df = df['value'] \
            .str.split('\t', expand=True).fillna('0') \
            .replace(r'\s+|\\n', ' ', regex=True) \
            .apply(pd.to_numeric)
        df['datetime'] = date  # Join date back to dataframe
        return df

    df = array_split()

# split dataframe df to df and df_spec 
    def remove_duplicate_spec():
        global df, df_spec
        df_spec = df.loc[df[123].isin([1])]
        df = df.loc[df[123].isin([0])]
        df_spec = df_spec.drop_duplicates(119)
        return df, df_spec


    df, df_spec = remove_duplicate_spec()

Question: Should i declare global df/ df_spec inside each function? Is this the best practice? or how can I optimize the code further

1 Answer 1

2

The best way is to use your dataframe as argument for each function.

df = pd.DataFrame({'datetime':[0,0,1,1,2], 'tagname':[0,0,1,1,2], 'other':range(95,100)})

def duplicate_check(df):
    return df.drop_duplicates(['datetime', 'tagname'], keep='last').drop(['tagname'], axis=1)

duplicate_check(df)

DataFrame:

   datetime  tagname  other
0         0        0     95
1         0        0     96
2         1        1     97
3         1        1     98
4         2        2     99

Result of duplicate_check(df):

   datetime  other
1         0     96
3         1     98
4         2     99
Sign up to request clarification or add additional context in comments.

7 Comments

File "<ipython-input-6-6e07979b163d>", line 10, in <cell line: 10> df = duplicate_check() TypeError: duplicate_check() missing 1 required positional argument: 'df'
if i pass df inside def duplicate_check (df): then i get above error
I edited my answer, hope this works for you. Use: duplicate_check(df)
Thank you, how could we do it in case of 3rd function which has 2 (df, df_spec) in one return
or in the case of 2nd function where there is date variable.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.