replacing repeated strings using regex in R

Question

I have a string as follows:

text <- "http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"

I want to eliminate all duplicated addresses, so my expected result is:

expected <- "http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"

I tried (^[\w|.|:|\/]*),\1+ in regex101.com and it works removing the first repetition of the string (fails at the second). However, if I port it to R's gsub it doesn't work as expected:

gsub("(^[\\w|.|:|\\/]*),\\1+", "\\1", text)

I've tried with perl = FALSE and TRUE to no avail.

What am I doing wrong?

Are these duplicates sequential, or out-of-order ?

user557597
– user557597

2017-07-25 00:52:30 +00:00
Commented Jul 25, 2017 at 0:52 — user557597
– user557597, Commented Jul 25, 2017 at 0:52
Always sequential

PavoDive
– PavoDive

2017-07-25 00:57:52 +00:00
Commented Jul 25, 2017 at 0:57 — PavoDive
– PavoDive, Commented Jul 25, 2017 at 0:57

score 4 · Accepted Answer · 2017-07-25 01:14:25Z

4

If they are sequential, you just need to modify your regex slightly.

Take out your BOS anchor ^.
Add a cluster group around the comma and backreference, then quantify it (?:,\1)+.
And, lose the pipe symbol | as in a class it's just a literal.

([\w.:/]+)(?:,\1)+

https://regex101.com/r/FDzop9/1

 ( [\w.:/]+ )         # (1), The adress
 (?:                  # Cluster
      , \1                 # Comma followed by what found in group 1 
 )+                   # Cluster end, 1 to many times

Note - if you use split and unique then combine, you will lose the ordering of the items.

edited Jul 25, 2017 at 1:14

answered Jul 25, 2017 at 1:01

user557597

Sign up to request clarification or add additional context in comments.

7 Comments

SymbolixAU Over a year ago

I can only see 2 distinct items in the text

SymbolixAU Over a year ago

yep all good. And this in R code is gsub(pattern = "([\\w.:/]+)(?:,\\1)+", "\\1", text, perl = TRUE)

Dominic Comtois Over a year ago

Not sure in what conditions you'd lose the ordering if using split/unique/combine... @SymbolixAU's answer seems to retain the original order (tested with a few variations).

user557597 Over a year ago

@DominicComtois - Oh, I'm not sure that unique doesn't require a sorted list to test for uniqueness. Otherwise, it would be a tremendous time penalty for say 100,000 items. After sorting, I'd guess the original ordering is forever lost.

SymbolixAU Over a year ago

If anyone's interested it's written in c

|

SymbolixAU · Accepted Answer · 2017-07-25 00:41:37Z

3

An alternative approach is to split the string on the comma, then unique the results, then re-combine for your single text

paste0(unique(strsplit(text, ",")[[1]]), collapse = ",")
# [1] "http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"

answered Jul 25, 2017 at 0:41

SymbolixAU

26.4k4 gold badges72 silver badges148 bronze badges

Comments

HNSKD · Accepted Answer · 2017-07-25 02:44:27Z

0

text <- c("http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/jpg.png",
          "http://q.co/imag/qrs.png,http://q.co/imag/qrs.png")
df <- data.frame(no = 1:2, text)

You can use functions from tidyverse if your strings are in a dataframe:

library(tidyverse)
separate_rows(df, text, sep = ",") %>% 
  distinct %>% 
  group_by(no) %>% 
  mutate(text = paste(text, collapse = ",")) %>% 
  slice(1)

The output is:

#     no                                              text
#   <int>                                             <chr>
# 1     1 http://x.co/imag/xyz.png,http://x.co/imag/jpg.png
# 2     2                          http://q.co/imag/qrs.png

answered Jul 25, 2017 at 2:44

HNSKD

1,6542 gold badges16 silver badges29 bronze badges

Collectives™ on Stack Overflow

replacing repeated strings using regex in R

3 Answers 3

7 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related