Extract substring between two specified substrings

Question

I have a list of substrings with the following pattern:

my.list <- list("file1\\subfile1-D.ext", "file12\\subfile9-D.ext", "file2\\subfile113-D.ext")

and so on. I'd like to extract the file numbers and the subfile-numbers into a numeric data frame containing the file/subfile numbers. So far, I've been using the following approach:

extract.file <- function(file.name){
  file.name <- sub("file", "", file.name)
  file.name <- sub("\\\\*subfile.*", "", file.name)
}

extract.subfile <- function(subfile.name){
  subfile.name <- sub("file.*subfile", "", subfile.name)
  subfile.name <- sub("-D.ext", "", subfile.name)
}

name.file <- lapply(my.list, extract.file)
name.file <- as.numeric(unlist(name.file))
name.subfile <- lapply(my.list, extract.subfile)
name.subfile <- as.numeric(unlist(name.subfile))

my.df <- data.frame(file=name.file, subfile=name.subfile)

I've also played around with first extracting the string locations with substring.location from stringr library (which yields another list with start and end values), and then looping over the two lists, but this gets too complicated again. Is there a better way to achieve the goal?

Thell · Accepted Answer · 2012-08-14 14:23:23Z

5

Some alternatives:
[Edit: strsplit can take an array and return a list, and shaves time in about half compared to nesting an apply within the rbind call.]

my.df <- do.call( rbind, strsplit( unlist(my.list), split="(\\\\|-D.ext)" ) )
my.df <- data.frame( my.df )
names( my.df ) <- c("file", "subfile")

or

my.df <- do.call( rbind, strsplit( unlist(my.list), split="[^[:alnum:]]" ) )[, 1:2]
my.df <- data.frame( my.df )
names( my.df ) <- c("file", "subfile")

One thing about doing things this way is that you are left with pretty worthless and redundant data if all of the input follows the original my.list sample.

Perhaps a better solution might be;

# Not sure why strsplit() returns an empty string on the first non-digit match,
# but it does and we account for it by dropping the first returned column.
my.list <- unlist( my.list )
my.df <- do.call( rbind, strsplit( my.list, split="[^[:digit:]]+" ) )[,-1]
my.df <- data.frame( my.list, my.df )
names( my.df ) <- c( "orig", "file", "subfile" )

We've saved quite a bit of memory/storage without all of that duplication and we gain the ability to manipulate things without fussing with text/character ordering/representation.

Check ?strsplit, ?regex, and ?grep for the matching stuff.

The data.frame setup is pretty straight forward... strsplit takes a vector and returns a list, while do.call requires a list to bind together.

edited Aug 14, 2012 at 14:23

answered Aug 13, 2012 at 17:28

Thell

5,95834 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

GSee Over a year ago

I find that lapply is safer than sapply when used with do.call(rbind, ...). After all, the rbind is doing the simplifying for you.

Thell Over a year ago

@Gsee, since strsplit is going to return a list and lapply will return a list then unlist with recursive=FALSE would be needed to pass the correct level of list to rbind, yes?

GSee Over a year ago

yes. You'd need unlist wrapped around the strsplit. I've got to start testing code before commenting. ;-)

Thell Over a year ago

Thanks for bringing that up! Made me go back and look again at strsplit output, which caused an edit in the answer. :)

AnjaM Over a year ago

@Thell That's perfect, exactly what I need to get as output, and in a really nice and clear way. Thanks a lot!

Andrie · Accepted Answer · 2012-08-13 15:14:57Z

2

Here is a regex with backreferences that seems to do what you ask for:

sapply(my.list, function(x)gsub(".*\\\\(.*)-D\\.ext", "\\1", x))
[1] "subfile1"   "subfile9"   "subfile113"

The "\\1" is a backreference that returns the value of the string inside the parentheses.

edited Aug 13, 2012 at 15:14

answered Aug 13, 2012 at 14:59

Andrie

180k52 gold badges456 silver badges504 bronze badges

Collectives™ on Stack Overflow

Extract substring between two specified substrings

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related