Using REGEX to remove duplicates when entire line is not a duplicate

Question

^(.*)(\r?\n\1)+$

replace with \1

The above is a great way to remove duplicate lines using REGEX but it requires the entire line to be a duplicate

However – what would I use if I want to detect and remove dups – when the entire line s a whole is not a dup – but just the first X characters

Example: Original File

12345 Dennis Yancey     University of Miami
12345 Dennis Yancey     University of Milan
12345 Dennis Yancey     University of Rome
12344 Ryan Gardner      University of Spain
12347 Smith John        University of Canada

Dups Removed

12345 Dennis Yancey     University of Miami
12344 Ryan Gardner      University of Spain
12347 Smith John        University of Canada

Which regex engine/language? Are all duplicates consecutive or can they be jumbled in with other lines, e.g. 1 1 2 1 where 1s are duplicates? — ctwheels
– ctwheels, Commented Nov 4, 2019 at 18:37
If duplicates are jumbled with potential for other non-duplicate lines to be between two duplicate rows, you can use ^(.{10}).*$[\s\S]*?\K^\1.*, but you'd have to run it until no more matches are found. This only works in some languages due to \K (e.g. PCRE) — ctwheels
– ctwheels, Commented Nov 4, 2019 at 18:41

bobble bubble · Accepted Answer · 2019-11-04 23:43:48Z

How about using a second group for checking eg the first 10 characters:

^((.{10}).*)(?:\r?\n\2.*)+

Where {n} specifies the amount of the characters from linestart that should be dupe checked.

the whole line is captured to $1 which is also used as replacement
the second group is used to check for duplicate line starts with

See this demo at regex101

Another idea would be the use of a lookahead and replace with empty string:

^(.{10}).*\r?\n(?=\1)

This one will just drop the current line, if captured $1 is ahead in the next line.

Here is the demo at regex101

For also removing duplicate lines, that contain up to 10 characters, a PCRE idea using conditionals: ^(?:(.{10})|(.{0,9}$)).*+\r?\n(?(1)(?=\1)|(?=\2$)) and replace with empty string.

If your regex flavor supports possessive quantifiers, use of .*+ will improve performance.

Be aware, that all these patterns (and your current regex) just target consecutive duplicate lines.

Collectives™ on Stack Overflow

Using REGEX to remove duplicates when entire line is not a duplicate

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related