3

I have the following dataframe in pandas (the df below is abbreviated):

    Index: 23253 entries, 7.0 to 30559.0
    Data columns (total 17 columns):
    Epoch         23190  non-null values
    follow        23253  non-null values
    T_Opp         245    non-null values
    T_Dir         171    non-null values
    Teacher       0      non-null values
    Activity      23253  non-null values
    Actor         23253  non-null values
    Recipient1    14608  non-null values
    dtypes: float64(10), object(7)

Columns like T_Opp and T_Dir have dummy (1/0) data in them. When values in these columns are true, I want to add data from the 'Actor' column to the 'Teacher' column. So far, I have this (where the "mask" gives the condition under which the data are true. checked this bit and it works):

    opp_mask = f_acts['Behavior'].str.contains('bp', na=False)
    opp_teacher = f_acts[opp_mask]['Recipient1']

If I were doing this based only on one column, I could simply plug these results into the Teacher column in the dataframe with something like this:

    df['Teacher'] = df[opp_mask]['Actor']

But I need to fill the Teacher column with with data from 6 other columns, without overwriting the earlier columns. I have an idea of how this might work, similar to this toy example:

    list = [1]*len(df.Teacher)
    df['Teacher'] = list

But I can't seem to figure out how to transform the output of the "mask" technique above to the correct format for this approach--it has the same index info but is shorter than the dataframe I need to add it to. What am I missing?

UPDATE: Adding the data below to clarify what I'm trying to do.

   follow   T_Opp   T_Dir   T_Enh   T_SocTol    Teacher    Actor    Recipient1
   7        0       1       0       0           NaN        51608    f 
   8        0       0       0       0           NaN        bla      NaN
   11       0       0       0       0           NaN        51601    NaN
   13       1       0       0       1           NaN        f        51602
   18       0       0       0       0           NaN        f        NaN

So for data like these, what I'm trying to do is check the T_ columns one at a time. If the value in a T_ column is true, fetch the data from the Actor column (if looking at the T_Opp or T_SocTol columns) or from the Recipient column (if looking at T_Enh or T_Dir columns). I want to copy that data into the currently empty Teacher column.

More than one of the T_ columns can be true at a time, but in these cases it will always be "grabbing" the same data twice. (In other words, I never need data from BOTH the Actor and Recipient columns. Only one or the other, for each row).

I want to copy that data into the currently empty Teacher column.

4
  • Do you mean that you want to create six additional columns where each is a version of Teacher replaced with one of the six columns that you're going to use for replacement? Commented Oct 8, 2013 at 23:06
  • Perhaps a toy example would make this clearer? It's not quite clear what's not working. (As an aside you should use .loc[msk, 'Actor'] rather than chain.) Commented Oct 8, 2013 at 23:22
  • So these 6 columns, are they mutually exclusive? So you would only have a value in 1 of these columns and if so set Teacher to this value? Commented Oct 9, 2013 at 7:18
  • The six columns are separate, but not mutually exclusive. If any of these values is true, I want to pull data from a different column (let's call it Actor), and write it in another different column (Teacher). Through this I want to maintain the data in those six columns. I'm not able to give a data example at the moment but will update with one later. Commented Oct 9, 2013 at 17:05

1 Answer 1

1

Here's an approach to masking and concatenating multiple columns with Series.where(). If the end result is a column of strings, numeric columns will need to be converted to string first with .astype(str).

In [23]: df
Out[23]: 
        C0  Mask1  Mask2 Val1 Val2
0  R_l0_g0      0      0   v1   v2
1  R_l0_g1      1      0   v1   v2
2  R_l0_g2      0      1   v1   v2
3  R_l0_g3      1      1   v1   v2

In [24]: df['Other'] = (df.Val1.astype(str).where(df.Mask1, '') + ',' + 
                        df.Val2.astype(str).where(df.Mask2, '')).str.strip(',')

In [25]: df
Out[25]: 
        C0  Mask1  Mask2 Val1 Val2  Other
0  R_l0_g0      0      0   v1   v2       
1  R_l0_g1      1      0   v1   v2     v1
2  R_l0_g2      0      1   v1   v2     v2
3  R_l0_g3      1      1   v1   v2  v1,v2

And here's another approach using DataFrame.where(). .where, like most pandas operations, performs automatic data alignment. Since the column names of the data frame and frame to mask with differ in this case, alignment can be disabled by masking with a raw, un-labeled numpy.ndarray (aka. .values).

In [23]: masked = df[['Val1', 'Val2']].\
                     where(df[['Mask1', 'Mask2']].values, '') + ','

In [24]: df['Other2'] = masked.sum(axis=1).str.strip(',')

In [25]: df
Out[25]: 
        C0  Mask1  Mask2 Val1 Val2  Other Other2
0  R_l0_g0      0      0   v1   v2              
1  R_l0_g1      1      0   v1   v2     v1     v1
2  R_l0_g2      0      1   v1   v2     v2     v2
3  R_l0_g3      1      1   v1   v2  v1,v2  v1,v2
Sign up to request clarification or add additional context in comments.

5 Comments

Hm, the first method here throws a TypeError: unsupported operand type(s) for +: 'float' and 'str' ... do I need to convert the object types?
The second method seems to not allow working with lists that repeat the same column names. By this I mean that for Val1 and Val2, I can't use Recipient1, Recipient1, Actor, Actor, Actor. I think this requirement wasn't totally clear in my first posting of the question. But it could also be that this method DOES work, and I'm just not properly adapting it.
ah, in the case of non-string columns, you can easily convert to string with .astype(str), e.g. df.Val1.astype(str).where(df.Mask1, '')
Let me know if the second approach still fails with explicit conversion to string (e.g. masked = df[['Val1', 'Val2']].astype(str).where(...)
After some digging, looks like pandas 0.12 doesn't handle the second approach above with duplicate column names, though 0.11 and master (latest) look ok -- please confirm pandas version if you're having issues still

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.