1

I have a list of substrings with the following pattern:

my.list <- list("file1\\subfile1-D.ext", "file12\\subfile9-D.ext", "file2\\subfile113-D.ext")

and so on. I'd like to extract the file numbers and the subfile-numbers into a numeric data frame containing the file/subfile numbers. So far, I've been using the following approach:

extract.file <- function(file.name){
  file.name <- sub("file", "", file.name)
  file.name <- sub("\\\\*subfile.*", "", file.name)
}

extract.subfile <- function(subfile.name){
  subfile.name <- sub("file.*subfile", "", subfile.name)
  subfile.name <- sub("-D.ext", "", subfile.name)
}

name.file <- lapply(my.list, extract.file)
name.file <- as.numeric(unlist(name.file))
name.subfile <- lapply(my.list, extract.subfile)
name.subfile <- as.numeric(unlist(name.subfile))

my.df <- data.frame(file=name.file, subfile=name.subfile)

I've also played around with first extracting the string locations with substring.location from stringr library (which yields another list with start and end values), and then looping over the two lists, but this gets too complicated again. Is there a better way to achieve the goal?

2 Answers 2

5

Some alternatives:
[Edit: strsplit can take an array and return a list, and shaves time in about half compared to nesting an apply within the rbind call.]

my.df <- do.call( rbind, strsplit( unlist(my.list), split="(\\\\|-D.ext)" ) )
my.df <- data.frame( my.df )
names( my.df ) <- c("file", "subfile")

or

my.df <- do.call( rbind, strsplit( unlist(my.list), split="[^[:alnum:]]" ) )[, 1:2]
my.df <- data.frame( my.df )
names( my.df ) <- c("file", "subfile")

One thing about doing things this way is that you are left with pretty worthless and redundant data if all of the input follows the original my.list sample.

Perhaps a better solution might be;

# Not sure why strsplit() returns an empty string on the first non-digit match,
# but it does and we account for it by dropping the first returned column.
my.list <- unlist( my.list )
my.df <- do.call( rbind, strsplit( my.list, split="[^[:digit:]]+" ) )[,-1]
my.df <- data.frame( my.list, my.df )
names( my.df ) <- c( "orig", "file", "subfile" )

We've saved quite a bit of memory/storage without all of that duplication and we gain the ability to manipulate things without fussing with text/character ordering/representation.


Check ?strsplit, ?regex, and ?grep for the matching stuff.

The data.frame setup is pretty straight forward... strsplit takes a vector and returns a list, while do.call requires a list to bind together.

Sign up to request clarification or add additional context in comments.

5 Comments

I find that lapply is safer than sapply when used with do.call(rbind, ...). After all, the rbind is doing the simplifying for you.
@Gsee, since strsplit is going to return a list and lapply will return a list then unlist with recursive=FALSE would be needed to pass the correct level of list to rbind, yes?
yes. You'd need unlist wrapped around the strsplit. I've got to start testing code before commenting. ;-)
Thanks for bringing that up! Made me go back and look again at strsplit output, which caused an edit in the answer. :)
@Thell That's perfect, exactly what I need to get as output, and in a really nice and clear way. Thanks a lot!
2

Here is a regex with backreferences that seems to do what you ask for:

sapply(my.list, function(x)gsub(".*\\\\(.*)-D\\.ext", "\\1", x))
[1] "subfile1"   "subfile9"   "subfile113"

The "\\1" is a backreference that returns the value of the string inside the parentheses.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.