2

I have two data frames The first one is df1 has 485513 columns and 100 rows,

head(df1)

sample  cg1 cg2 cg3 cg4 cg5 cg6 cg7 cg8 cg9 cg10    cg11
AAD_1   33435   33436   33437   33438   33439   33440   33441   33442   33443   33444   33445
AAD_2   0.33    1.33    2.33    3.33    4.33    5.33    6.33    7.33    8.33    9.33    10.33
AAD_3   0.56    1.56    2.56    3.56    4.56    5.56    6.56    7.56    8.56    9.56    10.56
AAD_4   45.9    46.9    47.9    48.9    49.9    50.9    51.9    52.9    53.9    54.9    55.9
AAD_5   46.9    47.9    48.9    49.9    50.9    51.9    52.9    53.9    54.9    55.9    56.9
AAD_6   47.9    48.9    49.9    50.9    51.9    52.9    53.9    54.9    55.9    56.9    57.9
AAD_7   48.9    49.9    50.9    51.9    52.9    53.9    54.9    55.9    56.9    57.9    58.9
AAD_8   49.9    50.9    51.9    52.9    53.9    54.9    55.9    56.9    57.9    58.9    59.9
AAD_9   50.9    51.9    52.9    53.9    54.9    55.9    56.9    57.9    58.9    59.9    60.9
AAD_10  51.9    52.9    53.9    54.9    55.9    56.9    57.9    58.9    59.9    60.9    61.9

and the second one has df2 84 rows and single column. I am aiming to get a subset of df1 using the values in the column from the df2 data frame.

head(df2)
    ID
    cg1
    cg2
    cg3
    cg4
    cg5

The values of df2 are the columns names of my interest from df1 and so I have tried the following one-liner in R.

> UP=(df1 %>% as.data.frame)[,df2$ID]

The Up data frame returns me with unmatched columns from my query df2

And it resulted in a data frame UP with 84 columns and 100 rows but none of the columns the above command line returned is matching with the input query data frame df2.

It would be great if someone suggests me an alternative solution

2
  • 2
    Why does this question have both tags for Pandas and R? Which solution are you looking for? Commented Aug 19, 2016 at 21:19
  • Actually R is better Commented Aug 22, 2016 at 8:21

2 Answers 2

2

Assuming df2 is a Series:

>>> df[df2.tolist()]

        cg1       cg2       cg3       cg4       cg5
0  33435.00  33436.00  33437.00  33438.00  33439.00
1      0.33      1.33      2.33      3.33      4.33
2      0.56      1.56      2.56      3.56      4.56
3     45.90     46.90     47.90     48.90     49.90
4     46.90     47.90     48.90     49.90     50.90
5     47.90     48.90     49.90     50.90     51.90
6     48.90     49.90     50.90     51.90     52.90
7     49.90     50.90     51.90     52.90     53.90
8     50.90     51.90     52.90     53.90     54.90
9     51.90     52.90     53.90     54.90     55.90

If it is a dataframe, then this should work:

df[df2.ID.tolist()]
Sign up to request clarification or add additional context in comments.

3 Comments

OP want R solution?
regardless of language confusion, this is a useful answer. @Alexander what about df2.squeeze().tolist()?
@piRSquared I believe df.ID is more explicit. It makes the code more clear for others reading it.
1

In R, we can just do

df[as.character(df2$ID)]

assuming that 'ID' column is factor. In case it is character class, it is more easier

df[df2$ID]

But if there are elements in 'ID' that are not in the column names of 'df', it may be better to use intersect

df[intersect(colnames(df), df2$ID)]

If the 'df' is a data.table, the usual way to subset columns will be to include the with =FALSE. It is mentioned in ?data.table

with

By default with=TRUE and j is evaluated within the frame of x; column names can be used as variables.

When with=FALSE j is a character vector of column names, a numeric vector of column positions to select or of the form startcol:endcol, and the value returned is always a data.table. with=FALSE is often useful in data.table to select columns dynamically. Note that x[, cols, with=FALSE] is equivalent to x[, .SD, .SDcols=cols].

Therefore, the above commands would be

 df[, as.character(df2$ID), with = FALSE]

or

 df[, df2$ID, with = FALSE] #if 'ID' is already character class.

Or

 df[, intersect(colnames(df), df2$ID), with = FALSE]

8 Comments

The above R one liner is throwing error as; When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey.
@user1017373 I assumed your dataset as data.frame and from your input I couldn't gather it was data.table or data.frame. If it is data.table, df[, as.character(df2$ID), with = FALSE]
The df1 is read as fread from data.table.its data.table, yes you are right.
@user1017373 Okay, then the with = FALSE should succeed.
Thanks it worked..Now I have perfect match in UP :) :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.