Replace all strings with numbers based on conditions

Question

I have a column of data that describes possible diseases. I am trying to change these qualitative values into quantitative ones. So for example setting conditions such as "if a row contains words "blood pressure" delete all characters and replace to be 3, if row contains "heart" replace to be 2, if the row contains "diabetes" or "kidney disease" replace to be 1, if any other condition replace to be 0.5"

For example my data looks like:

Gene     Condition
Gene1    Name=Asymmetrical dimethylarginine level, Name=Bipolar disorder and schizophrenia, Name=3-hydroxypropylmercapturic acid levels in smoker
Gene2    Name=blood pressure, Name=diabetes
Gene3    Name=heart disease
Gene4    Name=Childhood ear infection
Gene5    NA
Gene6    Name=kidney disease

The output I am trying to reach based on my mentioned conditions is:

Gene Condition
Gene1    0.5
Gene2    3
Gene3    2
Gene4    0.5
Gene5    NA
Gene6    1

I am new to R and so not sure if the way I'm trying is the best, but I'm trying to run my conditions to replace the specific strings (but not all characters), producing multiple numbers in a row (mixed with strings) if more than 1 condition is met, then applying a getmaxfunction for each row to get the largest number available. However I am stuck on setting up conditions to perform the string to number conversation. I've been trying to do:

data$condition[data$condition == "blood pressure"] <- "3"
data$condition[data$condition == "heart disease"] <- "2"
data$condition[data$condition == "diabetes" | "kidney disease"] <- "1"
data$condition[data$condition == "Name" && !"diabetes" | "kidney disease" | "blood pressure" | "heart disease"] <- "0.5"

However this gives an error that ' 'object of type 'closure' is not subsettable', and for this approach at least, I can't find the solution for this error online. Any help would be appreciated.

Example data (first time trying to give data, please let me know if something is amiss):

structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5", 
"Gene6"), Condition = c("    Name=Asymmetrical dimethylarginine level, Name=Bipolar disorder and schizophrenia, Name=3-hydroxypropylmercapturic acid levels in smoker", 
"    Name=blood pressure, Name=diabetes", "Name=heart disease", 
"Name=Childhood ear infection", NA, "Name=kidney disease")), row.names = c(NA, 
-6L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000001bea99a1ef0>)

There are certain open problems in your data. For example, row 2 contains, both, blood pressure and diabetes which have different values. What should be chosen in such scenarios? — tmfmnk
– tmfmnk, Commented Feb 17, 2020 at 19:01
The highest value should be chosen, apologies should've made clear I deem the highest value as most important, it's why I'm trying to get the numbers all in 1 cell then apply getmax to the cell/each row — DN1
– DN1, Commented Feb 18, 2020 at 9:17

zx8754 · Accepted Answer · 2020-02-17 19:16:32Z

Using grepl:

data$Condition[ grepl("blood pressure", data$Condition) ] <- "3"
data$Condition[ grepl("heart disease",  data$Condition) ] <- "2"
# etc...

Or slightly better approach, when there are multiple conditions convert them into new rows, then we can do direct comparison using == instead of regex match grepl:

res <- data[, list(Condition = unlist(strsplit(Condition, ","))), by = Gene
            ][, Condition := gsub("Name=", "", Condition) ]

res
# Gene                                         Condition
# 1: Gene1               Asymmetrical dimethylarginine level
# 2: Gene1                Bipolar disorder and schizophrenia
# 3: Gene1  3-hydroxypropylmercapturic acid levels in smoker
# 4: Gene2                                    blood pressure
# 5: Gene2                                          diabetes
# 6: Gene3                                     heart disease
# 7: Gene4                           Childhood ear infection
# 8: Gene5                                              <NA>
# 9: Gene6                                    kidney disease

G. Grothendieck · Accepted Answer · 2020-02-17 20:18:09Z

Now, the matching operation can be represented as a complex join in SQL. First create numDF which is a two column data frame with the names to match in the first column and their numbers in the second column. Then perform the join.

library(sqldf)

nums <- c("blood pressure" = 3, heart = 2, diabetes = 1, "kidney disease" = 1)
numDF <- data.frame(Name = names(nums), Value = as.vector(nums))

sqldf("select 
    a.Gene, 
    max(case when a.Condition is not Null then coalesce(b.Value, 0.5) end) Condition
  from DF a 
  left join numDF b on a.Condition like '%' || b.Name || '%'
  group by Gene", method = "raw")

giving:

   Gene Condition
1 Gene1       0.5
2 Gene2       3.0
3 Gene3       2.0
4 Gene4       0.5
5 Gene5        NA
6 Gene6       1.0

Note

One cannot use dput on an object with an internal pointer so I have modified the dput output to be useable:

DF <-
structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5", 
"Gene6"), Condition = c("    Name=Asymmetrical dimethylarginine leve,l Name=Bipolar disorder and schizophrenia, Name=3-hydroxypropylmercapturic acid levels in smoker", 
"    Name=blood pressure, Name=diabetes", "Name=heart disease", 
"Name=Childhood ear infection", NA, "Name=kidney disease")), 
row.names = c(NA, -6L), class = "data.frame")

EDIT

Have modified to add the max condition.

Collectives™ on Stack Overflow

Replace all strings with numbers based on conditions

2 Answers 2

Comments

Note

EDIT

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Note

EDIT

Comments

Your Answer

Sign up or log in

Post as a guest

Related