Remove duplicate but uncomplete strings of text

Question

I have a hard time figuring out how to remove duplicate but incomplete strings of text. No success using perl, awk or sed.

I need to transform:

a b
a b c
a b c d
a b c d e
a b c d x
a b c d z

into

a b
a b c d e
a b c d x
a b c d z

Every incomplete pattern has to be deleted, but (1) not each final complete and unique string and (2) not strings two words in length.

All answers I could find address removal of identical duplicates.

please update the question with some of your coding attempts and the (wrong) results generated by said code — markp-fuso
– markp-fuso, Commented Jul 28, 2024 at 0:02
Firstly, what have you tried and what hasn't worked? That shows what you have done so that no one gives answers and suggestions that are duplicates and it also shows what you have you have given an effort because this isn't a service where people do the work for you. Second, your question needs more clarity. What determines what is or isn't a duplicate or incomplete? Is - a b c d d a duplicate? Is - a or - b incomplete? What of - a 1 or - a %? Edit the question and include this information. Do no post it in the comments where it can get lost. — Nasir Riley
– Nasir Riley, Commented Jul 28, 2024 at 0:59
If your real strings aren't always single letters (e.g. if they can be multi-char strings, possibly including punctuation or other non-alphabetic characters) then don't provide sample input in your question that's entirely single-letter chars or you're likely to get answers that will work for your example but won't work for your real data. Always make your sample input/output minimal but realistic in terms of the types of chars in strings and single vs multi char/field/line input. — Ed Morton
– Ed Morton, Commented Jul 28, 2024 at 18:10
If a b c d e appeared on 2 contiguous input lines, should it appear twice in the output or once? — Ed Morton
– Ed Morton, Commented Jul 29, 2024 at 19:37
I'm a little confused by "incomplete pattern." It sounds like the rule is: if line k is a (strict) prefix of any other line, then delete line k, unless line k is exactly two words long. So a suffix of another line could stay? — wobtax
– wobtax, Commented Jul 30, 2024 at 2:24

aviro · Accepted Answer · 2024-07-29 10:05:37Z

Assuming that all of the following conditions are met:

The strings in this file are sorted (meaning, the "incomplete duplicates" you want to remove are followed by the line that contain it)
You want to match only the BEGINNING of the line, so for instance, in the following sequence, the first line is not going to be removed (the second line contains the first line, but doesn't begin with the same sequence
```
a b c d
e a b c d
```

Than this is very similar to: Using sed or awk, how can i delete a line whenever the next line begins with the same content followed by a slash?.

Here's a possible solution:

awk 'NR==1 {prev=$0; next} index($0 ,prev" ") != 1 || split(prev, _) == 2 {print prev} {prev=$0} END {if (NR>0) print prev}' FILENAME

Multi line for readability:

awk '
  NR==1 {prev=$0; next}
  index($0, prev" ") != 1 || split(prev, _) == 2 {print prev}
  {prev=$0}
  END {if (NR>0) print prev}
  ' FILENAME

Copy the first line of the file to a variable called prev and skip to the next line.
Starting from the second line, check if prev" " (prev with an extra space at the end) matches the beginning (index 1) of the current line ($0). If not, print the previous line.
If the previous line is consisted of 2 words exactly (split(prev, _) == 2), print it anyway
- I'm using the underscore _ in split(prev, _) just as a hint that I'm not going to use the array resulted by split.
Copy the current line ($0) to prev.
When awk finishes reading the file, print the last line (prev), unless the file is empty.

Example:

$ cat testfile
a b
a b c
a b c d
a b c d e
a b c d x
a b c d z

$ awk '
  NR==1 {prev=$0; next}
  index($0, prev" ") != 1 || split(prev, _) == 2 {print prev}
  {prev=$0}
  END {if (NR>0) print prev}
  ' testfile
a b
a b c d e
a b c d x
a b c d z

@EdMorton Right, but you don't need the space after $0, only after prev, since we're trying to match it to the beginning of the current line. — aviro
– aviro, Commented Jul 29, 2024 at 9:34
@aviro actually, after thinking about it a bit, you do need the " " at the end of both strings or a b wouldn't match a b since we'd be doing index("a b","a b "), assuming exact duplicates should be treated the same way as incomplete duplicates. — Ed Morton
– Ed Morton, Commented Jul 29, 2024 at 19:35
@EdMorton I haven't thought about it, I guess you're right. Thanks. (Though in your specific example there are two strings, which according to the OP it's not really clear if both should stay, because the strict rule says that two strings line shouldn't be deleted, but that's just nitpicking, sorry) — aviro
– aviro, Commented Jul 30, 2024 at 6:58

Arnaud Valmary · Accepted Answer · 2024-07-28 16:35:18Z

#! /usr/bin/awk -f

{
    arr_stock[NR] = $0
    arr_nf[NR] = NF
    if (DEBUG) printf("STO:%d:%s:\n", NR, arr_stock[NR])
}

END {
    for (i_stock = 1; i_stock <= NR; i_stock++) {
        flag_found = 0
        motif = arr_stock[i_stock]
        if (DEBUG) printf("- MOT:%d:%d:%s:\n", i_stock, arr_nf[i_stock], motif)
        if (arr_nf[i_stock] > 2) {
            for (i_stock_2 = 1; i_stock_2 <= NR; i_stock_2++) {
                if (DEBUG) printf("  - TST:%d:%s:\n", i_stock_2, arr_stock[i_stock_2])
                if (i_stock != i_stock_2 && arr_stock[i_stock_2] ~ ("^" motif) && arr_stock[i_stock_2] != motif) {
                    if (DEBUG) printf("    - FOUND\n")
                    flag_found = 1
                    break
                }
            }
        }
        if (flag_found == 0) {
            printf("%s\n", motif)
        }
    }
}

Ed Morton · Accepted Answer · 2024-07-30 10:39:06Z

Using any awk and sort:

$ cat tst.sh
#!/usr/bin/env bash

sort -r "${@:--}" |
awk '
    (NF == 2) || (index(prev" ",$0" ") != 1)
    { prev = $0 }
' |
sort

$ ./tst.sh file
a b
a b c d e
a b c d x
a b c d z

The " " at the end of each string in index() is necessary so that a b d would not falsely match as a substring of a b dog, assuming we only want whole-word comparisons, and a b e would match itself, assuming we want to delete exact duplicate lines as well as substring lines, e.g. given this more comprehensive sample input:

$ cat file2
a b
a b c
a b c d
a b c d e
a b c d x
a b c d z
a b d
a b dog
a b e
a b e

we get the expected output:

$ ./tst.sh file2
a b
a b c d e
a b c d x
a b c d z
a b d
a b dog
a b e

With the above script we sort the input first so that longer strings appear before shorter strings that start with the same characters, thereby making it easy for awk to test if the current string is a substring of the previous one, then we sort again for the final output.

That approach of sorting first means it'll work no matter what order the input is in, e.g.:

$ shuf file2 > file3

$ cat file3
a b
a b c d
a b c d z
a b dog
a b c d e
a b c d x
a b c
a b e
a b d
a b e

$ ./tst.sh file3
a b
a b c d e
a b c d x
a b c d z
a b d
a b dog
a b e

If we also wanted the output order to be the same as the input order given unsorted input like above, we could apply a Decorate-Sort-Undecorate idiom to add original line numbers first then sort by and remove those at the end:

$ cat tst2.sh
#!/usr/bin/env bash

awk -v OFS='\t' '{print NR, $0}' "${@:--}" |
sort -r -k2 |
awk -v OFS='\t' '
    { nr=$1; sub(/[^\t]+\t/,"") }
    (NF == 2) || (index(prev" ",$0" ") != 1) {
        print nr, $0
    }
    { prev = $0 }
' |
sort -nk1 |
cut -f2-

$ ./tst2.sh file3
a b
a b c d z
a b dog
a b c d e
a b c d x
a b e
a b d

Stack Exchange Network

Remove duplicate but uncomplete strings of text

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

Remove duplicate but uncomplete strings of text

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions