3

I have multiple strings, and I want to extract the part that matches. In practice my strings are directories, and I need to choose where to write a file, which is the location that matches in all strings. For example, if you have a vector with three strings:

data.dir <- c("C:\\data\\files\\subset1\\", "C:\\data\\files\\subset3\\", "C:\\data\\files\\subset3\\")

...the part that matches in all strings is "C:\data\files\". How can I extract this?

1
  • 1
    Are you looking for an arbitrary match in the middle of the strings or are you just looking for a prefix match? If the latter, are you looking for a delimited match? (The application as presented does permit that last assumption, although the title doesn't suggest that limitation.) Commented Nov 27, 2016 at 23:08

2 Answers 2

3

strsplit and intersect the overlapping parts recursively using Reduce. You can then piece it back together by paste-ing.

paste(Reduce(intersect, strsplit(data.dir, "\\\\")), collapse="\\")
#[1] "C:\\data\\files"

As @g-grothendieck notes, this will fail in certain circumstances like:

data.dir <- c("C:\\a\\b\\c\\", "C:\\a\\X\\c\\") 

An ugly hack might be something like:

tail(
  Reduce(
    intersect,
    lapply(strsplit(data.dir, "\\\\"),
      function(x) sapply(1:length(x), function(y) paste(x[1:y], collapse="\\") ) 
    )
  ),
1)

...which will deal with either case.


Alternatively, use dirname if you only ever have one extra directory level:

unique(dirname(data.dir))
#[1] "C:/data/files"
Sign up to request clarification or add additional context in comments.

Comments

2

g contains the character positions to successive backslashes in data.dir[1]. From this create a logical vector ok whose ith element is TRUE if the first g[i] characters of all elements in data.dir are the same, i.e. all elements of substr(data.dir, 1, g[i]) are the same. If ok[1] is TRUE then there is a non-zero length common prefix whose length is given by the first g[k] characters of data.dir[1] where k (which equals rle(ok)$lengths[1]) is the leading number of TRUE values in ok; otherwise, there is no common prefix so return "".

g <- gregexpr("\\", data.dir[1], fixed = TRUE)[[1]]
ok <- sapply(g, function(i) all(substr(data.dir[1], 1, i) == substr(data.dir, 1, i)))
if (ok[1]) substr(data.dir[1], 1, g[rle(ok)$lengths[1]]) else ""

For data.dir defined in the question the last line gives:

[1] "C:\\data\\files\\"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.