0

I have

x<-c('abczzzdef','abcxxdef')

I want a function

fn(x)

that returns a length 2 vector

[1] 'zzz' 'xx'

How?

(I have tried searching for an answer but search terms like 'partial matching' give me something quite different)

Update

'length 2 vector' means length(fn(x)) is 2 and fn(x)[1] give "zzz" while fn(x)[2] gives "xx". After trying out the answers provided, I realize I haven't been specific enough.

  • There will only be 2 strings (in a vector) that I am comparing.
  • The location of the different parts (zzz and xx) can be anywhere in the string. i.e. it could be x<-c('zzzabcdef','xxabcdef') or it could be at the end. But the 2 strings are always at the same respective place (i.e. both at the beginning, or both at the middle, or both at the end).
  • zzz and xx are obviously generic names. They could be different things (numbers, alphabet, symbols) and of different length (not necessarily 3 and 2).
  • Same comment applies to abc and def.

I have got some test cases

x1<-c('abcxxxttt','abczzttt')
x2<-c('abcxxxdef','abczz126gsdef')
x3<-c('xx_x123../t','z_z126gs123../t') 

fn(x1) should give "xxx" "zz"

fn(x2) should give "xxx" "zz126gs"

fn(x3) should give "xx_x" "z_z126gs"

3
  • What does returns a length 2 vector mean? What would you expect 'abczzzdefgggklmmmn to give? Commented Jun 7, 2014 at 0:44
  • You need to add more about what you expect. Do you want anything with repeating letters? Only z's and x's? Commented Jun 7, 2014 at 0:45
  • It looks like you need ?intersect?. strsplit your "x" and, then, collect the elements that are not in the intersection of the two splitted strings. Commented Jun 7, 2014 at 0:59

3 Answers 3

1
x<-c('abczzzdef','abcxxdef')
fn <- function(x) unlist(regmatches(x, gregexpr("(.)\\1+", x)))
fn(x)
# [1] "zzz" "xx" 
Sign up to request clarification or add additional context in comments.

2 Comments

(+1) That's much better than mine, which became embarrassing after seeing this so I deleted it :)
@lukeA I have updated the question, can you try again?
1

First of all, it would have been better to include all that detail in the first version of the question. No need to waste people's time coming up with solutions that wont work for you just because you didn't clearly explain what you needed. If you need to change a question that much after it's already been answered, it probably would be best to ask a new question rather than completely changing your first one.

What you are tying to do, find the largest non-shared portion of a string, can be a pretty messy process for a computer. A somewhat standard measure of string dissimilarity is the generalized Levenshtein distance which R has implemented in the adist function. It can produce a string which tells you how to transform one string into another via matches, insertions, deletions, and substitutions. If I find the longest string of matches, I'll have a pretty good idea of where to extract the unique information.

So this method basically focuses on extracting the regions outside of the best matches. Here's the function that does the matching

fn <- function(x) {
    ld <- attr(adist(x[1], x[2], counts=T, 
        costs=c(substitutions=500)),"trafos")[1,1]
    starts <- gregexpr("M+", ld)[[1]]
    lens <- attr(starts,"match.length")
    starts <- as.vector(starts)
    ends <- starts + lens - 1
    bm <- which.max(lens)
    if (starts[bm]==1 | ends[bm]==nchar(ld)) {
        #beg/end
        for( i in which(starts==1 | ends==nchar(ld))) {
            substr(ld, starts[i], ends[i]) <- 
                paste(rep("X", lens[i]), collapse="")
        }
    } else {
        #middle
        substr(ld, starts[bm], ends[bm]) <- 
            paste(rep("X", lens[bm]), collapse="")
    }
    tr <- strsplit(ld,"")[[1]]
    x1 <- cumsum(tr %in% c("D","M","X"))[!tr %in% c("X","I")]
    x2 <- cumsum(tr %in% c("I","M","X"))[!tr %in% c("X","D")]
    c(substr(x[1], min(x1), max(x1)), substr(x[2], min(x2), max(x2)))
}

Now we can apply it to your test data

x1 <- c('abcxxxttt','abczzttt')
x2 <- c('abcxxxdef','abczz126gsdef')
x3 <- c('xx_x123../t','z_z126gs123../t') 

fn(x1)
# [1] "xxx" "zz" 
fn(x2)
# [1] "xxx"     "zz126gs"
fn(x3)
# [1] "xx_x"     "z_z126gs"

So we get the results you expect. Here I do little error checking. I assume there will always be some overlap and some non-overlapping regions. If that's not true, the function will likely produce an error or unexpected results.

2 Comments

I don't even know how you figured out what the OP wants, +1 for that.
@RichardScriven I'm not sure I completely understood, but I seem to be passing all the test cases at least (for now) so that's the best I could hope for at this point.
0
 gsub("([^xz]*)([xz]*)([^xz]*)", "\\2", x)
[1] "zzz" "xx" 

> getxz <- function(x, str) gsub(paste0("([^",str, ']*)([', str, ']*)([^', str, ']*)'),
                                 "\\2", x)
> getxz(x=x,"xz")
[1] "zzz" "xx" 

In response to the new examples I offer these tests which I think provides three successes:

> getxz(x=x1,"xz_")
[1] "xxx" "zz" 
> getxz(x=x2,"xz_")
[1] "xxx" "zz" 
> getxz(x=x3,"xz_")
[1] "xx_x" "z_z" 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.