2

I want to parse a string using R, and I'd like to get out a list of objects. Brackets, spaces and commas in the string dictate the structure of the final list:

  1. each pair of brackets is separated by a space and the words in each pair of brackets has to form a new object of the list;

  2. words in brackets are separated by comma and should form different elements in each listed object;

  3. the mentioned structure can also be found nested within a pair of brackets.

Here is an example of the string:

x <- "(K01596,K01610) (K01689) (K01834,K15633,K15634,K15635) (K00927) (K00134,K00150) (K01803) ((K01623,K01624,K11645) (K03841,K02446,K11532,K01086,K04041),K01622)"

The desired output should like this:

list(c("K01596","K01610"), "K01689", c("K01834","K15633","K15634","K15635"), "K00927", c("K00134","K00150"), "K01803", list(list(c("K01623","K01624","K11645"), c("K03841","K02446","K11532","K01086","K04041")), "K01622"))

I manage to solve how to do the parsing for case 1)

match <- gregexpr("\\((?>[^()]|(?R))*\\)", x, perl = T)
x2 <- as.list(substring(x, match[[1]], match[[1]] + attr(match[[1]], "match.length") - 1))

and case 2) is also easy, I can just remove the brackets with gsub and split the words using strsplit. The problem is how to parse case 3), when I have a nested level like:

((K01623,K01624,K11645) (K03841,K02446,K11532,K01086,K04041),K01622)

and I have to get out a listed object that is a list itself:

list(list(c("K01623","K01624","K11645"), c("K03841","K02446","K11532","K01086","K04041")), "K01622")
2
  • 1.,2.,3. can be restated much more clearly as "1. brackets surround lists 2. commas separate individual items 3. lists can be nested" Commented Jun 26, 2018 at 0:07
  • You also didn't mention the format of individual list items, but it looks like "letter followed by several (5?) digits, e.g. K01596" Commented Jun 26, 2018 at 0:13

2 Answers 2

1

You can convert to JSON, and then use jsonlite to convert to a list. Once you have this, you can simplify, collapse, or reorganize your list however you like.

library(jsonlite)
library(stringr)

add_paren <- function(x){
  x <- str_sub(x, end = -2) #remove comma
  paste0("(", x, "), ") #add enclosing paren and return comma
} 
x <- str_replace_all(x, "\\(\\(.*\\)\\,", add_paren)

x <- gsub("\\(", "\\[", x)
x <- gsub("\\)", "\\]", x)
x <- gsub("\\] \\[", "\\], \\[", x)

add_quote <- function(x) paste0('"', x, '"')

x <- str_replace_all(x, "K[0-9]*", add_quote)
x <- paste0("[", x, "]")

x2 <- fromJSON(x)

Resulting in:

dput(x2)

list(c("K01596", "K01610"), "K01689", c("K01834", "K15633", "K15634", 
"K15635"), "K00927", c("K00134", "K00150"), "K01803", list(list(
    c("K01623", "K01624", "K11645"), c("K03841", "K02446", "K11532", 
    "K01086", "K04041")), "K01622"))

str(x2)

List of 7
 $ : chr [1:2] "K01596" "K01610"
 $ : chr "K01689"
 $ : chr [1:4] "K01834" "K15633" "K15634" "K15635"
 $ : chr "K00927"
 $ : chr [1:2] "K00134" "K00150"
 $ : chr "K01803"
 $ :List of 2
  ..$ :List of 2
  .. ..$ : chr [1:3] "K01623" "K01624" "K11645"
  .. ..$ : chr [1:5] "K03841" "K02446" "K11532" "K01086" ...
  ..$ : chr "K01622"
Sign up to request clarification or add additional context in comments.

3 Comments

I was going through this route and then abandoned. Nice solution
I am not familiar with json format but seems to fit my problem quite well! However, if possible, I'd like to have the 7th element of the list structured differently; if you run the last chunk of code I posted in the question you will see what I mean
Ah, didn't catch that. I've added a line x <- str_replace_all(x, "\\(\\(.*\\)\\,", add_paren) which should ensure that anything matching a pattern ((.*), gets nested one level deeper. This seems to give the desired output.
0

I suggest you apply the regex you already found for case 1) recursively to the input. That is, call your recursive function for each match found.

If no match is found you are in case 2) and can just use strsplit on the input. I have put together an example function below:

constructList <- function(x) {

  matches <- gregexpr("\\((?>[^()]|(?R))*\\)", x, perl = T)

  if (matches[[1]][1] == -1) {
    return(strsplit(x, ",")[[1]])
  }

  lapply(
    lapply(1:length(matches[[1]]), function(i)
                                        substr(x,
                                               matches[[1]][i] + 1,
                                               matches[[1]][i] + attr(matches[[1]], "match.length")[i] - 2)),
    constructList)

}

Output seems OK:

constructList(x)
[[1]]
[1] "K01596" "K01610"

[[2]]
[1] "K01689"

[[3]]
[1] "K01834" "K15633" "K15634" "K15635"

[[4]]
[1] "K00927"

[[5]]
[1] "K00134" "K00150"

[[6]]
[1] "K01803"

[[7]]
[[7]][[1]]
[1] "K01623" "K01624" "K11645"

[[7]][[2]]
[1] "K03841" "K02446" "K11532" "K01086" "K04041"

1 Comment

Oh nice function! but there are still 2 problems with the nested part. The last term is missing "K01622" and I would like to have the case 3 structured like how you can see by running the last chunk of code I posted in the question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.