How to extract different part of 2 strings

Question

I have

x<-c('abczzzdef','abcxxdef')

I want a function

fn(x)

that returns a length 2 vector

[1] 'zzz' 'xx'

How?

(I have tried searching for an answer but search terms like 'partial matching' give me something quite different)

Update

'length 2 vector' means length(fn(x)) is 2 and fn(x)[1] give "zzz" while fn(x)[2] gives "xx". After trying out the answers provided, I realize I haven't been specific enough.

There will only be 2 strings (in a vector) that I am comparing.
The location of the different parts (zzz and xx) can be anywhere in the string. i.e. it could be x<-c('zzzabcdef','xxabcdef') or it could be at the end. But the 2 strings are always at the same respective place (i.e. both at the beginning, or both at the middle, or both at the end).
zzz and xx are obviously generic names. They could be different things (numbers, alphabet, symbols) and of different length (not necessarily 3 and 2).
Same comment applies to abc and def.

I have got some test cases

x1<-c('abcxxxttt','abczzttt')
x2<-c('abcxxxdef','abczz126gsdef')
x3<-c('xx_x123../t','z_z126gs123../t')

fn(x1) should give "xxx" "zz"

fn(x2) should give "xxx" "zz126gs"

fn(x3) should give "xx_x" "z_z126gs"

What does returns a length 2 vector mean? What would you expect 'abczzzdefgggklmmmn to give? — Tyler Rinker
– Tyler Rinker, Commented Jun 7, 2014 at 0:44
You need to add more about what you expect. Do you want anything with repeating letters? Only z's and x's? — Nicole White
– Nicole White, Commented Jun 7, 2014 at 0:45
It looks like you need ?intersect?. strsplit your "x" and, then, collect the elements that are not in the intersection of the two splitted strings. — alexis_laz
– alexis_laz, Commented Jun 7, 2014 at 0:59

lukeA · Accepted Answer · 2014-06-07 01:00:18Z

1

x<-c('abczzzdef','abcxxdef')
fn <- function(x) unlist(regmatches(x, gregexpr("(.)\\1+", x)))
fn(x)
# [1] "zzz" "xx"

answered Jun 7, 2014 at 1:00

lukeA

54.4k5 gold badges102 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Rich Scriven Over a year ago

(+1) That's much better than mine, which became embarrassing after seeing this so I deleted it :)

qoheleth Over a year ago

@lukeA I have updated the question, can you try again?

MrFlick · Accepted Answer · 2014-06-07 06:14:08Z

First of all, it would have been better to include all that detail in the first version of the question. No need to waste people's time coming up with solutions that wont work for you just because you didn't clearly explain what you needed. If you need to change a question that much after it's already been answered, it probably would be best to ask a new question rather than completely changing your first one.

What you are tying to do, find the largest non-shared portion of a string, can be a pretty messy process for a computer. A somewhat standard measure of string dissimilarity is the generalized Levenshtein distance which R has implemented in the adist function. It can produce a string which tells you how to transform one string into another via matches, insertions, deletions, and substitutions. If I find the longest string of matches, I'll have a pretty good idea of where to extract the unique information.

So this method basically focuses on extracting the regions outside of the best matches. Here's the function that does the matching

fn <- function(x) {
    ld <- attr(adist(x[1], x[2], counts=T, 
        costs=c(substitutions=500)),"trafos")[1,1]
    starts <- gregexpr("M+", ld)[[1]]
    lens <- attr(starts,"match.length")
    starts <- as.vector(starts)
    ends <- starts + lens - 1
    bm <- which.max(lens)
    if (starts[bm]==1 | ends[bm]==nchar(ld)) {
        #beg/end
        for( i in which(starts==1 | ends==nchar(ld))) {
            substr(ld, starts[i], ends[i]) <- 
                paste(rep("X", lens[i]), collapse="")
        }
    } else {
        #middle
        substr(ld, starts[bm], ends[bm]) <- 
            paste(rep("X", lens[bm]), collapse="")
    }
    tr <- strsplit(ld,"")[[1]]
    x1 <- cumsum(tr %in% c("D","M","X"))[!tr %in% c("X","I")]
    x2 <- cumsum(tr %in% c("I","M","X"))[!tr %in% c("X","D")]
    c(substr(x[1], min(x1), max(x1)), substr(x[2], min(x2), max(x2)))
}

Now we can apply it to your test data

x1 <- c('abcxxxttt','abczzttt')
x2 <- c('abcxxxdef','abczz126gsdef')
x3 <- c('xx_x123../t','z_z126gs123../t') 

fn(x1)
# [1] "xxx" "zz" 
fn(x2)
# [1] "xxx"     "zz126gs"
fn(x3)
# [1] "xx_x"     "z_z126gs"

So we get the results you expect. Here I do little error checking. I assume there will always be some overlap and some non-overlapping regions. If that's not true, the function will likely produce an error or unexpected results.

I don't even know how you figured out what the OP wants, +1 for that.
@RichardScriven I'm not sure I completely understood, but I seem to be passing all the test cases at least (for now) so that's the best I could hope for at this point.

IRTFM · Accepted Answer · 2014-06-07 03:23:30Z

0

 gsub("([^xz]*)([xz]*)([^xz]*)", "\\2", x)
[1] "zzz" "xx" 

> getxz <- function(x, str) gsub(paste0("([^",str, ']*)([', str, ']*)([^', str, ']*)'),
                                 "\\2", x)
> getxz(x=x,"xz")
[1] "zzz" "xx"

In response to the new examples I offer these tests which I think provides three successes:

> getxz(x=x1,"xz_")
[1] "xxx" "zz" 
> getxz(x=x2,"xz_")
[1] "xxx" "zz" 
> getxz(x=x3,"xz_")
[1] "xx_x" "z_z"

edited Jun 7, 2014 at 3:23

answered Jun 7, 2014 at 0:45

IRTFM

264k22 gold badges381 silver badges503 bronze badges

Collectives™ on Stack Overflow

How to extract different part of 2 strings

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related