The matching columns from a data frame based on value in a column from other data Frame

Question

I have two data frames The first one is df1 has 485513 columns and 100 rows,

head(df1)

sample  cg1 cg2 cg3 cg4 cg5 cg6 cg7 cg8 cg9 cg10    cg11
AAD_1   33435   33436   33437   33438   33439   33440   33441   33442   33443   33444   33445
AAD_2   0.33    1.33    2.33    3.33    4.33    5.33    6.33    7.33    8.33    9.33    10.33
AAD_3   0.56    1.56    2.56    3.56    4.56    5.56    6.56    7.56    8.56    9.56    10.56
AAD_4   45.9    46.9    47.9    48.9    49.9    50.9    51.9    52.9    53.9    54.9    55.9
AAD_5   46.9    47.9    48.9    49.9    50.9    51.9    52.9    53.9    54.9    55.9    56.9
AAD_6   47.9    48.9    49.9    50.9    51.9    52.9    53.9    54.9    55.9    56.9    57.9
AAD_7   48.9    49.9    50.9    51.9    52.9    53.9    54.9    55.9    56.9    57.9    58.9
AAD_8   49.9    50.9    51.9    52.9    53.9    54.9    55.9    56.9    57.9    58.9    59.9
AAD_9   50.9    51.9    52.9    53.9    54.9    55.9    56.9    57.9    58.9    59.9    60.9
AAD_10  51.9    52.9    53.9    54.9    55.9    56.9    57.9    58.9    59.9    60.9    61.9

and the second one has df2 84 rows and single column. I am aiming to get a subset of df1 using the values in the column from the df2 data frame.

head(df2)
    ID
    cg1
    cg2
    cg3
    cg4
    cg5

The values of df2 are the columns names of my interest from df1 and so I have tried the following one-liner in R.

> UP=(df1 %>% as.data.frame)[,df2$ID]

The Up data frame returns me with unmatched columns from my query df2

And it resulted in a data frame UP with 84 columns and 100 rows but none of the columns the above command line returned is matching with the input query data frame df2.

It would be great if someone suggests me an alternative solution

Why does this question have both tags for Pandas and R? Which solution are you looking for? — Alexander
– Alexander, Commented Aug 19, 2016 at 21:19

Alexander · Accepted Answer · 2016-08-19 21:18:41Z

2

Assuming df2 is a Series:

>>> df[df2.tolist()]

        cg1       cg2       cg3       cg4       cg5
0  33435.00  33436.00  33437.00  33438.00  33439.00
1      0.33      1.33      2.33      3.33      4.33
2      0.56      1.56      2.56      3.56      4.56
3     45.90     46.90     47.90     48.90     49.90
4     46.90     47.90     48.90     49.90     50.90
5     47.90     48.90     49.90     50.90     51.90
6     48.90     49.90     50.90     51.90     52.90
7     49.90     50.90     51.90     52.90     53.90
8     50.90     51.90     52.90     53.90     54.90
9     51.90     52.90     53.90     54.90     55.90

If it is a dataframe, then this should work:

df[df2.ID.tolist()]

edited Aug 19, 2016 at 21:18

answered Aug 19, 2016 at 21:17

Alexander

111k32 gold badges212 silver badges208 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Merlin Over a year ago

OP want R solution?

piRSquared Over a year ago

regardless of language confusion, this is a useful answer. @Alexander what about df2.squeeze().tolist()?

Alexander Over a year ago

@piRSquared I believe df.ID is more explicit. It makes the code more clear for others reading it.

akrun · Accepted Answer · 2016-08-22 09:03:42Z

1

In R, we can just do

df[as.character(df2$ID)]

assuming that 'ID' column is factor. In case it is character class, it is more easier

df[df2$ID]

But if there are elements in 'ID' that are not in the column names of 'df', it may be better to use intersect

df[intersect(colnames(df), df2$ID)]

If the 'df' is a data.table, the usual way to subset columns will be to include the with =FALSE. It is mentioned in ?data.table

with

By default with=TRUE and j is evaluated within the frame of x; column names can be used as variables.

When with=FALSE j is a character vector of column names, a numeric vector of column positions to select or of the form startcol:endcol, and the value returned is always a data.table. with=FALSE is often useful in data.table to select columns dynamically. Note that x[, cols, with=FALSE] is equivalent to x[, .SD, .SDcols=cols].

Therefore, the above commands would be

 df[, as.character(df2$ID), with = FALSE]

or

 df[, df2$ID, with = FALSE] #if 'ID' is already character class.

Or

 df[, intersect(colnames(df), df2$ID), with = FALSE]

edited Aug 22, 2016 at 9:03

answered Aug 20, 2016 at 4:35

akrun

891k38 gold badges590 silver badges700 bronze badges

8 Comments

ARJ Over a year ago

The above R one liner is throwing error as; When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey.

akrun Over a year ago

@user1017373 I assumed your dataset as data.frame and from your input I couldn't gather it was data.table or data.frame. If it is data.table, df[, as.character(df2$ID), with = FALSE]

ARJ Over a year ago

The df1 is read as fread from data.table.its data.table, yes you are right.

akrun Over a year ago

@user1017373 Okay, then the with = FALSE should succeed.

ARJ Over a year ago

Thanks it worked..Now I have perfect match in UP :) :)

|

Collectives™ on Stack Overflow

The matching columns from a data frame based on value in a column from other data Frame

2 Answers 2

3 Comments

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related