5

I have a Dataframe that has a column with integers that I would like to combine with a column with string values. Both columns are of object dtype. The problem is that these columns can also be NaN.

The solutions I have been able to find result in different errors or undesirable outcomes.

My dataframe is like the below:

index dosagedurationunit dosagequantityvalue dosagequantityunit quantityvalue
0 day NaN NaN NaN
1 day NaN tablet(s) NaN
2 day 2 NaN NaN
3 day 1 tablet(s) NaN
4 day 2 tablet(s) NaN

Code to create the dataframe:

df = pd.DataFrame([["day",None,None,None],["day",None,"tablet(s)",None],["day",2,"tablet(s)",None],["day",1,"tablet(s)",None],["day",2,"tablet(s)",None]], columns=["dosagedurationunit","dosagequantityvalue","dosagequantityunit","quantityvalue"])

The below answer will work on columns of the same type (str): Combine pandas string columns with missing values

  • Converting the columns to str dtype prior to concatenation results in 'nan' strings such as "NaN tablet(s)".
  • Using the below code results in TypeErrors when there are integers in one of the columns to be 'concatenated'.
df['DOSE'] = df[['dosagequantityvalue', 'dosagequantityunit']].apply(
            lambda x: None if x.isnull().all() else ' '.join(x.dropna()), axis=1)
  • TypeError: sequence item 0: expected str instance, int found

Desired output dataframe:

index dosagedurationunit dosagequantityvalue dosagequantityunit quantityvalue NORMALIZED_DOSE
0 day NaN NaN NaN NaN
1 day NaN tablet(s) NaN tablet(s)
2 day 2 NaN NaN 2
3 day 1 tablet(s) NaN 1 tablet(s)
4 day 2 tablet(s) NaN 2 tablet(s)

Realistically, a NORAMLIZED_DOSE of NaN or "tablet(s)" provides zero information. I could just drop all rows where dosagequantityvalue is NaN, but I don't know if this will work on a production/non-sample dataset. Besides, I still need a function that handles this operation gracefully.

How can I concatenate two columns (dosagequantityvalue & dosagequantityunit) into a new column (NORMALIZED_DOSE) while handling cases where there may be integers and NaN values in one or both columns?

1 Answer 1

2

In looking for an optimized solution I ended up doing a modified approach to the answer provided by tdy and the one here Combine pandas string columns with missing values

I ended up turning this code into a function as I had a need to use it repeatedly. Hope this helps someone else who comes across the same problem:

# functions
def concat_df_cols(df, source_cols, target_col, sep=" ", na_rep=""):
    """ Add separator and replace NaN to empty space, while handling columns of different types.

    Args:
        df (dataframe): The dataframe to be modified
        source_cols (list): The columns to concatenate.
        target_col (str): The destination column for the concatenated source columns.
        sep (str): The separator with which to concatenate the columns.
        na_rep (str): The default replacement value for NaN values.
                      # Note, anything other than the default empty string will result in the
                        na_rep persisting after the concatentation.

    Returns:
        dataframe: The modified dataframe
    """
    df = df.replace(np.nan, na_rep)  # Replace nans with ''
    df[source_cols] = df[source_cols].astype(str)  # Convert cols to str to permit concatenation
    df = df.replace(r'^\s*$', np.nan, limit=1, regex=True)  # Put NaNs back
    # Concat source_cols into target_col
    df[target_col] = df[source_cols].apply(
        lambda x: None if x.isnull().all() else sep.join(x.dropna()), axis=1)
    return df


def concat_df_cols_fast(df, sep=" ", na_rep=""):
    """ Add separator and replace NaN to empty space, while handling columns of different types.

    Args:
        df (dataframe): The dataframe to be modified, with only source_cols included**.
        sep (str): The separator with which to concatenate the columns.
        na_rep (str): The default replacement value for NaN values.
                      # Note, anything other than the default empty string will result in the
                        na_rep persisting after the concatentation.

    Returns:
        dataframe: The modified dataframe
    """
    df = df.applymap(str)  # Convert cols to str to permit concatenation
    # Add separator and replace NaN to empty space
    # Convert to lists
    arr = df.fillna(na_rep).values.tolist()
    # Replace empty spaces to NaN using list comprehension
    s = pd.Series([sep.join(x).strip(sep) for x in arr if x]).replace('^$', np.nan, regex=True)
    # Replace NaN to None
    s = s.where(s.notnull(), None)
    return s

# setup
df = pd.DataFrame([['day',np.nan,np.nan,np.nan],['day',np.nan,'tablet(s)',np.nan],['day',2,np.nan,np.nan],['day',1,'tablet(s)',np.nan],['day',2,'tablet(s)',np.nan]],columns=['dosagedurationunit','dosagequantityvalue','dosagequantityunit','quantityvalue'])
# Make the df 50000 rows
df = pd.concat([df]*10000).reset_index(drop=True)

##### Approach 1 #####
# This approach took on average 0.27553908449 seconds
df['NORMALIZED_DOSAGE'] = concat_df_cols_fast(df[['dosagequantityvalue', 'dosagequantityunit']], )

##### Approach 2 #####
# This approach took on average 5.92792463605 seconds
# replace nans with ''
df = df.replace(np.nan, '')
# concat value + unit
df['NORMALIZED_DOSAGE'] = df.dosagequantityvalue.astype(str) + ' ' + df.dosagequantityunit.astype(str)
# put nans back
df = df.replace(r'^\s*$', np.nan, limit=1, regex=True)

##### Approach 3 #####
# This approach took on average 27.7539046249 seconds
df = concat_df_cols(df, source_cols=['dosagequantityvalue', 'dosagequantityunit'],
                                  target_col='NORMALIZED_DOSAGE')
       

UPDATE: Refactored functions:


def concat_df_cols_new(df, sep=" ", na_rep=""):
    """ Add separator and replace NaN to empty space, while handling columns of different types.

    Args:
        df (dataframe): The dataframe to be modified, with only source_cols included**.
        source_cols (list): The columns to concatenate.
        sep (str): The separator with which to concatenate the columns.
        na_rep (str): The default replacement value for NaN values.
                      # Note, anything other than the default empty string will result in the
                        na_rep persisting after the concatentation.

    Returns:
        dataframe: The modified dataframe
    """
    df = df.replace(np.nan, sep, inplace=False)
    df = df.applymap(str)  # Convert cols to str to permit concatenation
    # Add separator and replace NaN to empty space
    # Convert to lists
    arr = df.values.tolist()
    # Replace empty spaces to NaN using list comprehension
    df = pd.Series([sep.join(x).strip(sep) for x in arr]).replace('^$', np.nan, regex=True)
    return df

def replace_concat_replace_new(df):
    df = df.replace(np.nan, '')
    s = df.dosagequantityvalue.astype(str) + ' ' + df.dosagequantityunit.astype(str)
    s = s.replace(r'^\s*$', np.nan, regex=True)
    s = s.replace(r'\s*$', '', regex=True)  # Trim trailing whitespace
    s = s.replace(r'^\s*', '', regex=True)  # Trim leading whitespace
    return s

df['NORMALIZED_DOSAGE_CONCAT'] = concat_df_cols_new(df[['dosagequantityvalue', 'dosagequantityunit']])
# 131.98 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['NORMALIZED_DOSAGE'] = replace_concat_replace_new(df[['dosagequantityvalue', 'dosagequantityunit']])
# 395.97 ms ± 28.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Ultimately, I'll go with concat_df_cols_new simply because I can use this function on dataframes with different column names and the runtime is on currently ~3x better. Unless there's a solution for those too..

Sign up to request clarification or add additional context in comments.

5 Comments

Hmm interesting, my version is actually faster when I %timeit. I updated my answer with the %timeit results for my code vs concat_df_cols_fast().
That's interesting. I got a similar result to yours when I ran it again just now (different times but replace_concat_replace is ~1.7x as fast). I believe the difference compared to the initial runtime I recorded was due to the code running faster as a function. stackoverflow.com/questions/11241523/…
Did some more refactoring and had them more or less in lockstep on small dfs. I've added the new function as concat_df_cols_new. I also had to add some logic to replace_concat_replace to strip the whitespace. Regex is very costly and results in the runtime for replace_concat_replace_new increasing dramatically on large dataframes. Ultimately, I'll go with concat_df_cols_new simply because I can use this function on dataframes with different column names and the runtime is on currently ~3x better. Unless there's a solution for those too.. Appreciate the help!
Oh nice link about function vs non-function speed. Re:stripping whitespace, I did another refactoring and got replace_concat_replace() to be faster again by replacing those 3 regexes with a single: s = s.str.strip().replace('', np.nan). However, concat_df_cols() still seems like the best way to handle arbitrary columns.
Yeah, didn't know about the speed difference of code in functions! Nice update, I thought there might be a better solution to the whitespace issue but agreed, handling arbitrary columns is a nice benefit of the concat_df_cols approach. Wish there was a way to add two generic cols together using col + ' ' + col. Thanks a lot for this iterative back and forth!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.