0

I am working in R trying to generate several distinct vectors using a for loop.

First I created a small reproducible example data frame called df.

cluster.assignment <- c("1 Unknown", "1 Unknown", "2 Neuron","3 
PBMC","4 Basket")
Value1 <- c("a","b","c","d","e")
Value2 <- c("191","234","178","929","123")
df <- data.frame(cluster.assignment,Value1,Value2)

df

  cluster.assignment Value1 Value2
1          1 Unknown      a    191
2          1 Unknown      b    234
3           2 Neuron      c    178
4             3 PBMC      d    929
5           4 Basket      e    123 . 

Next I create a variable named clusters that includes keys to the datasets that I am interested in.

clusters <- c("1 ","4 ")

Here is my attempt to extract rownames of the data of interest in df using a for loop.

for (COI in clusters) { 
  name2 <- c(gsub(" ","", paste("Cluster", COI, sep = "_")))
  assign(Cluster_1, name2, envir = parent.frame())
  name2 <- grep(COI, df$cluster.assignment)
}

Desired output is two vectors called Cluster_1 and Cluster_4.

Cluster_1 would contain the values 1 and 2

Cluster_4 would contain the value 5

I can't seem to figure out how to assign the name of the COI variable to be the name of the output vector.

2
  • COI takes the value of each element of clusters, that is, first it is "1 " and then it is "2 ". A number with a space is an exceptionally bad variable name--is this really what you want, to assign the name of the COI variable to be the name of the output? Commented Sep 4, 2018 at 19:00
  • In this case yes because I am mining an existing dataset generated by someone else. Commented Sep 4, 2018 at 19:03

2 Answers 2

1

I would suggest against using assign. Instead, I'll create a named list. See this answer for a long discussion of why lists are better than sequentially named variables. If, at any point, you decide you want to convert the list to objects in the global environment, you can use list2env, but doing so will probably just make more work.

## subset the data to the parts we care about, use `split` to separate it
## into a list
subdf = df[grepl(paste(clusters, collapse = "|"), df$cluster.assignment), ]
result = split(subdf, subdf$cluster.assignment, drop = TRUE)
result
# $`1 Unknown`
#   cluster.assignment Value1 Value2
# 1          1 Unknown      a    191
# 2          1 Unknown      b    234
# 
# $`4 Basket`
#   cluster.assignment Value1 Value2
# 5           4 Basket      e    123

## name the list as desired
names(result) = paste("Cluster", trimws(clusters), sep = "_")
result
# $`Cluster_1`
#   cluster.assignment Value1 Value2
# 1          1 Unknown      a    191
# 2          1 Unknown      b    234
# 
# $Cluster_4
#   cluster.assignment Value1 Value2
# 5           4 Basket      e    123

## if only the row names are needed, use lapply
result = lapply(result, row.names)
result
# $`Cluster_1`
# [1] "1" "2"
# 
# $Cluster_4
# [1] "5"

A few other notes - I assume you are including the spaces in clusters to prevent, e.g., "1" from matching "12 foo". You might consider using the regex word boundary "\\b1\\b" instead, as "1 " will still match, say, "11 foo" or "21 bar". Better yet, you could use strplit or similar to create a new column with just the numeric key you want to match.

Sign up to request clarification or add additional context in comments.

1 Comment

Oh my, I see now why the spaces are so bad. Thanks for your suggestions and very informative answer I will give them a try!
0

I don't see the necessity to create a for loop for this unless you have your own reasons, but the following code gives you what you want:

library(data.table)
Cluster_1<-df[df$cluster.assignment %like% "1 ", c("Value1", "Value2")]
Cluster_2<-df[df$cluster.assignment %like% "4 ", c("Value1", "Value2")]
View(Cluster_1);View(Cluster_2)

you can remove or alter c("Value1", "Value2") to get the columns that you want in the final output.

1 Comment

I should have specified that this is a small portable example. Unfortunately in real life I need to repeat this over hundreds of different COI values. So a loop to iterate the process and make it portable across datasets is required. The heart of the question really is how do we do this in a for loop or some other high throughput way.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.