Combine (concatenate) pandas columns with missing values AND different types (str & int)

Question

I have a Dataframe that has a column with integers that I would like to combine with a column with string values. Both columns are of object dtype. The problem is that these columns can also be NaN.

The solutions I have been able to find result in different errors or undesirable outcomes.

My dataframe is like the below:

index	dosagedurationunit	dosagequantityvalue	dosagequantityunit	quantityvalue
0	day	NaN	NaN	NaN
1	day	NaN	tablet(s)	NaN
2	day	2	NaN	NaN
3	day	1	tablet(s)	NaN
4	day	2	tablet(s)	NaN

Code to create the dataframe:

df = pd.DataFrame([["day",None,None,None],["day",None,"tablet(s)",None],["day",2,"tablet(s)",None],["day",1,"tablet(s)",None],["day",2,"tablet(s)",None]], columns=["dosagedurationunit","dosagequantityvalue","dosagequantityunit","quantityvalue"])

The below answer will work on columns of the same type (str): Combine pandas string columns with missing values

Converting the columns to str dtype prior to concatenation results in 'nan' strings such as "NaN tablet(s)".
Using the below code results in TypeErrors when there are integers in one of the columns to be 'concatenated'.

df['DOSE'] = df[['dosagequantityvalue', 'dosagequantityunit']].apply(
            lambda x: None if x.isnull().all() else ' '.join(x.dropna()), axis=1)

TypeError: sequence item 0: expected str instance, int found

Desired output dataframe:

index	dosagedurationunit	dosagequantityvalue	dosagequantityunit	quantityvalue	NORMALIZED_DOSE
0	day	NaN	NaN	NaN	NaN
1	day	NaN	tablet(s)	NaN	tablet(s)
2	day	2	NaN	NaN	2
3	day	1	tablet(s)	NaN	1 tablet(s)
4	day	2	tablet(s)	NaN	2 tablet(s)

Realistically, a NORAMLIZED_DOSE of NaN or "tablet(s)" provides zero information. I could just drop all rows where dosagequantityvalue is NaN, but I don't know if this will work on a production/non-sample dataset. Besides, I still need a function that handles this operation gracefully.

How can I concatenate two columns (dosagequantityvalue & dosagequantityunit) into a new column (NORMALIZED_DOSE) while handling cases where there may be integers and NaN values in one or both columns?

bekfen · Accepted Answer · 2021-04-07 04:27:18Z

2

In looking for an optimized solution I ended up doing a modified approach to the answer provided by tdy and the one here Combine pandas string columns with missing values

I ended up turning this code into a function as I had a need to use it repeatedly. Hope this helps someone else who comes across the same problem:

# functions
def concat_df_cols(df, source_cols, target_col, sep=" ", na_rep=""):
    """ Add separator and replace NaN to empty space, while handling columns of different types.

    Args:
        df (dataframe): The dataframe to be modified
        source_cols (list): The columns to concatenate.
        target_col (str): The destination column for the concatenated source columns.
        sep (str): The separator with which to concatenate the columns.
        na_rep (str): The default replacement value for NaN values.
                      # Note, anything other than the default empty string will result in the
                        na_rep persisting after the concatentation.

    Returns:
        dataframe: The modified dataframe
    """
    df = df.replace(np.nan, na_rep)  # Replace nans with ''
    df[source_cols] = df[source_cols].astype(str)  # Convert cols to str to permit concatenation
    df = df.replace(r'^\s*$', np.nan, limit=1, regex=True)  # Put NaNs back
    # Concat source_cols into target_col
    df[target_col] = df[source_cols].apply(
        lambda x: None if x.isnull().all() else sep.join(x.dropna()), axis=1)
    return df


def concat_df_cols_fast(df, sep=" ", na_rep=""):
    """ Add separator and replace NaN to empty space, while handling columns of different types.

    Args:
        df (dataframe): The dataframe to be modified, with only source_cols included**.
        sep (str): The separator with which to concatenate the columns.
        na_rep (str): The default replacement value for NaN values.
                      # Note, anything other than the default empty string will result in the
                        na_rep persisting after the concatentation.

    Returns:
        dataframe: The modified dataframe
    """
    df = df.applymap(str)  # Convert cols to str to permit concatenation
    # Add separator and replace NaN to empty space
    # Convert to lists
    arr = df.fillna(na_rep).values.tolist()
    # Replace empty spaces to NaN using list comprehension
    s = pd.Series([sep.join(x).strip(sep) for x in arr if x]).replace('^$', np.nan, regex=True)
    # Replace NaN to None
    s = s.where(s.notnull(), None)
    return s

# setup
df = pd.DataFrame([['day',np.nan,np.nan,np.nan],['day',np.nan,'tablet(s)',np.nan],['day',2,np.nan,np.nan],['day',1,'tablet(s)',np.nan],['day',2,'tablet(s)',np.nan]],columns=['dosagedurationunit','dosagequantityvalue','dosagequantityunit','quantityvalue'])
# Make the df 50000 rows
df = pd.concat([df]*10000).reset_index(drop=True)

##### Approach 1 #####
# This approach took on average 0.27553908449 seconds
df['NORMALIZED_DOSAGE'] = concat_df_cols_fast(df[['dosagequantityvalue', 'dosagequantityunit']], )

##### Approach 2 #####
# This approach took on average 5.92792463605 seconds
# replace nans with ''
df = df.replace(np.nan, '')
# concat value + unit
df['NORMALIZED_DOSAGE'] = df.dosagequantityvalue.astype(str) + ' ' + df.dosagequantityunit.astype(str)
# put nans back
df = df.replace(r'^\s*$', np.nan, limit=1, regex=True)

##### Approach 3 #####
# This approach took on average 27.7539046249 seconds
df = concat_df_cols(df, source_cols=['dosagequantityvalue', 'dosagequantityunit'],
                                  target_col='NORMALIZED_DOSAGE')

UPDATE: Refactored functions:


def concat_df_cols_new(df, sep=" ", na_rep=""):
    """ Add separator and replace NaN to empty space, while handling columns of different types.

    Args:
        df (dataframe): The dataframe to be modified, with only source_cols included**.
        source_cols (list): The columns to concatenate.
        sep (str): The separator with which to concatenate the columns.
        na_rep (str): The default replacement value for NaN values.
                      # Note, anything other than the default empty string will result in the
                        na_rep persisting after the concatentation.

    Returns:
        dataframe: The modified dataframe
    """
    df = df.replace(np.nan, sep, inplace=False)
    df = df.applymap(str)  # Convert cols to str to permit concatenation
    # Add separator and replace NaN to empty space
    # Convert to lists
    arr = df.values.tolist()
    # Replace empty spaces to NaN using list comprehension
    df = pd.Series([sep.join(x).strip(sep) for x in arr]).replace('^$', np.nan, regex=True)
    return df

def replace_concat_replace_new(df):
    df = df.replace(np.nan, '')
    s = df.dosagequantityvalue.astype(str) + ' ' + df.dosagequantityunit.astype(str)
    s = s.replace(r'^\s*$', np.nan, regex=True)
    s = s.replace(r'\s*$', '', regex=True)  # Trim trailing whitespace
    s = s.replace(r'^\s*', '', regex=True)  # Trim leading whitespace
    return s

df['NORMALIZED_DOSAGE_CONCAT'] = concat_df_cols_new(df[['dosagequantityvalue', 'dosagequantityunit']])
# 131.98 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['NORMALIZED_DOSAGE'] = replace_concat_replace_new(df[['dosagequantityvalue', 'dosagequantityunit']])
# 395.97 ms ± 28.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Ultimately, I'll go with concat_df_cols_new simply because I can use this function on dataframes with different column names and the runtime is on currently ~3x better. Unless there's a solution for those too..

edited Apr 7, 2021 at 4:27

answered Apr 6, 2021 at 0:56

bekfen

815 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

tdy Over a year ago

Hmm interesting, my version is actually faster when I %timeit. I updated my answer with the %timeit results for my code vs concat_df_cols_fast().

bekfen Over a year ago

That's interesting. I got a similar result to yours when I ran it again just now (different times but replace_concat_replace is ~1.7x as fast). I believe the difference compared to the initial runtime I recorded was due to the code running faster as a function. stackoverflow.com/questions/11241523/…

bekfen Over a year ago

Did some more refactoring and had them more or less in lockstep on small dfs. I've added the new function as concat_df_cols_new. I also had to add some logic to replace_concat_replace to strip the whitespace. Regex is very costly and results in the runtime for replace_concat_replace_new increasing dramatically on large dataframes. Ultimately, I'll go with concat_df_cols_new simply because I can use this function on dataframes with different column names and the runtime is on currently ~3x better. Unless there's a solution for those too.. Appreciate the help!

tdy Over a year ago

Oh nice link about function vs non-function speed. Re:stripping whitespace, I did another refactoring and got replace_concat_replace() to be faster again by replacing those 3 regexes with a single: s = s.str.strip().replace('', np.nan). However, concat_df_cols() still seems like the best way to handle arbitrary columns.

bekfen Over a year ago

Yeah, didn't know about the speed difference of code in functions! Nice update, I thought there might be a better solution to the whitespace issue but agreed, handling arbitrary columns is a nice benefit of the concat_df_cols approach. Wish there was a way to add two generic cols together using col + ' ' + col. Thanks a lot for this iterative back and forth!

Collectives™ on Stack Overflow

Combine (concatenate) pandas columns with missing values AND different types (str & int)

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related