2

I have a string as follows:

text <- "http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"

I want to eliminate all duplicated addresses, so my expected result is:

expected <- "http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"

I tried (^[\w|.|:|\/]*),\1+ in regex101.com and it works removing the first repetition of the string (fails at the second). However, if I port it to R's gsub it doesn't work as expected:

gsub("(^[\\w|.|:|\\/]*),\\1+", "\\1", text)

I've tried with perl = FALSE and TRUE to no avail.

What am I doing wrong?

2
  • Are these duplicates sequential, or out-of-order ? Commented Jul 25, 2017 at 0:52
  • Always sequential Commented Jul 25, 2017 at 0:57

3 Answers 3

4

If they are sequential, you just need to modify your regex slightly.

Take out your BOS anchor ^.
Add a cluster group around the comma and backreference, then quantify it (?:,\1)+.
And, lose the pipe symbol | as in a class it's just a literal.

([\w.:/]+)(?:,\1)+

https://regex101.com/r/FDzop9/1

 ( [\w.:/]+ )         # (1), The adress
 (?:                  # Cluster
      , \1                 # Comma followed by what found in group 1 
 )+                   # Cluster end, 1 to many times

Note - if you use split and unique then combine, you will lose the ordering of the items.

Sign up to request clarification or add additional context in comments.

7 Comments

I can only see 2 distinct items in the text
yep all good. And this in R code is gsub(pattern = "([\\w.:/]+)(?:,\\1)+", "\\1", text, perl = TRUE)
Not sure in what conditions you'd lose the ordering if using split/unique/combine... @SymbolixAU's answer seems to retain the original order (tested with a few variations).
@DominicComtois - Oh, I'm not sure that unique doesn't require a sorted list to test for uniqueness. Otherwise, it would be a tremendous time penalty for say 100,000 items. After sorting, I'd guess the original ordering is forever lost.
If anyone's interested it's written in c
|
3

An alternative approach is to split the string on the comma, then unique the results, then re-combine for your single text

paste0(unique(strsplit(text, ",")[[1]]), collapse = ",")
# [1] "http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"

Comments

0
text <- c("http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/jpg.png",
          "http://q.co/imag/qrs.png,http://q.co/imag/qrs.png")
df <- data.frame(no = 1:2, text)

You can use functions from tidyverse if your strings are in a dataframe:

library(tidyverse)
separate_rows(df, text, sep = ",") %>% 
  distinct %>% 
  group_by(no) %>% 
  mutate(text = paste(text, collapse = ",")) %>% 
  slice(1)

The output is:

#     no                                              text
#   <int>                                             <chr>
# 1     1 http://x.co/imag/xyz.png,http://x.co/imag/jpg.png
# 2     2                          http://q.co/imag/qrs.png

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.