1

I have an RTF file that contains a list of pdf file paths. Such as

Category1:
./Folder1/Folder2/1.pdf:18
./Folder3/2.pdf:18
./Folder5/4.pdf:10

Category2:
./Folder3/2.pdf:18
./Folder5/4.pdf:10

Category3:
./Folder1/Folder2/1.pdf:18
./Folder5/4.pdf:10

Category4:
./Folder6/7.pdf:10
./Folder5/4.pdf:10
./Folder3/2.pdf:18

SideNote: The number num after the *.pdf:num can be ignored. Folder's full path can be ignore too. The entity of interest is just the file's name.pdf

I would like to have a sorted output with respect to pdf names and its number of occurences

Output would be in format (name of file : number of times the file has appeared in the rtf) such as:

4.pdf :  4
2.pdf :  3

Note 2: Any file that is mentioned less than 3 times can be ignored.

1
  • 2
    Presumably the "Categories" can be discarded as superfluous? Also, do you want some check on file identity (size, creation date)? The reason I'm asking is because file 1.pdf in Folder_2 might be a different file from file 1.pdf in Folder_3. Thx. Commented Nov 5, 2024 at 20:56

5 Answers 5

2

One possible tool is awk. This command can do the work:

awk -F\/ '/pdf/{split($NF,a,":");b[a[1]]+=1} END {for (i in b) if(b[i]>2) print i" : "b[i]}' input_file

The script use / as delimiter, then split in a the last field with : delimiter, count in b the occurrences of a. And on the end print in loo all the elements of a where count is > 2

It is possible to simplify by splitting the line using for delimiters / and :. In such case the script is:

awk -F'[/:]' '/pdf/{b[$(NF-1)]+=1} END {for (i in b) if(b[i]>2) print i" : "b[i]}' input_file

P.S. If you want the sum of the values after : from source file you can try something like:

awk -F'[/:]' '/pdf/{b[$(NF-1)]+=1;c[$(NF-1)]+=$NF} END {for (i in b) if(b[i]>2) print i" : "b[i]", "c[i]}'  input_file

Just added c array to sum the values after :

5
  • 1
    This is perfect @Romeo Ninov. Very neat oneliner solution. And skips printing any file where counter is < 3 I thank you very much. And have a great day. Commented Nov 6, 2024 at 9:16
  • I wonder how extensible is awk though. For instance. Is it powerful enough to Print something like 4.pdf : 4 , 40 <\newline> 2.pdf : 3 , 54 Where 40 = the sum values that appeared after : delimiter after 4.pdf And 54 = the sum of values that appeared after : delimiter 2.pdf ? Much appreciated Commented Nov 6, 2024 at 9:31
  • 1
    @Ronnie, yep, no problem, see my edited answer. Next time will be good to create new question as this is the philosophy of this site, one question, one or more answers. Extending the scope is possible but not recommended :) Commented Nov 6, 2024 at 9:39
  • 1
    Worked like a charm. Duly noted regarding Changing Scope @Romeo Ninov Commented Nov 6, 2024 at 12:08
  • 2
    I had posted this as an answer but now notice it's too similar to your middle script to warrant being separate so here's a similar alternative: awk -F'[/:]' -v OFS=' :\t' 'NF>2{cnt[$(NF-1)]++} END{for (f in cnt) if (cnt[f]>2) print f, cnt[f]}' file. Commented Nov 7, 2024 at 10:18
1

Using Raku (formerly known as Perl_6)

~$ raku -ne 'BEGIN my %seen; 
             if .chars && /\.pdf/ { $_.subst(/ <?after \.pdf> \: \d+ $ /).IO.basename andthen %seen{$_}++ }; 
             END .say for %seen.sort: +*.key.match(/^ \d+ <?before \.pdf>/);'   file

OR (more simply):

~$ raku -ne 'BEGIN my %seen; 
             if .chars && s/ <?after \.pdf> \: \d+ $ // { %seen{$_.IO.basename}++ }; 
             END .say for %seen.sort: +*.key.match(/^ \d+ <?before \.pdf>/);'   file

OR (even more simply):

~$ raku -ne 'BEGIN my %seen; 
             if .chars && s/ <?after \.pdf> \: \d+ $ // { %seen{$_.IO.basename}++ }; 
             END .say for %seen.sort: +*.key.IO.extension: "";'   file

Raku is a programming language in the Perl family. Using Raku's awk-like -ne non-autoprinting commandline flags, you can obtain a sorted hash of key/value pairs where the value per PDF file name equals the count of the number of times that filename was seen. Output is sorted numerically according to filename (as a number). Using .say in the END block will give you paired "key => value" output:

Sample Input:

Category1:
./Folder1/Folder2/1.pdf:18
./Folder3/2.pdf:18
./Folder5/4.pdf:10

Category2:
./Folder3/2.pdf:18
./Folder5/4.pdf:10

Category3:
./Folder1/Folder2/1.pdf:18
./Folder5/4.pdf:10

Category4:
./Folder6/7.pdf:10
./Folder5/4.pdf:10
./Folder3/2.pdf:18

Sample Output:

1.pdf => 2
2.pdf => 3
4.pdf => 4
7.pdf => 1

If you need paired paired "key : value" output, change the END block of your code to:

 END put($_.key, " : ", $_.value) for %hash.sort: +*.key.IO.extension: "";'

If you need to eliminate pairs with .value < 2, further change the END block of your code to:

 END put($_.key, " : ", $_.value if .value > 2 ) for %hash.sort: +*.key.IO.extension: "";'

Finally, if you prefer code written in a "chained" method/function call -style, the code below gives the same (desired) code as that above:

~$ raku -e 'my  %seen = lines.grep( *.chars > 0 && / \.pdf /)  \ 
                        .map( *.subst(/ <?after \.pdf> \: \d+ $ / ).IO.basename).Bag;  \ 
            for %seen.sort( +*.key.IO.extension: "") { 
                put $_.key ~" : "~ $_.value if .value > 2 };'   file

https://raku.org

2
  • 1
    Thank you so much for such good explaination @jubilatious1 I did try it. It works as you said, however Romeo Ninov's answer is way simpler. And complete. in a way that it excludes printing all counters that are less than 2. Hence, 2.pdf => 3 4.pdf => 4 I thank you nonetheless. Good day~ Commented Nov 6, 2024 at 9:10
  • 1
    Will update, as it is simple to add a clause to eliminate .value < 2. FYI, I enjoy @RomeoNinov 's answers as much as you do. However I took great pains to provide results in numerical sort order, and I don't see an awk answer with (numerically) sorted output. Cheers. Commented Nov 6, 2024 at 10:40
1

With perl:

$ perl -lne '$c{$1}++ if m{([^/]*\.pdf):\d+$};
             END {
               print "$_ : $c{$_}" for
                 sort {$c{$b} <=> $c{$a}} grep {$c{$_} > 2} keys %c
             }' your-file
4.pdf : 4
2.pdf : 3
0

Extract that file to be a TEXT file, then do

$ sed < TEXT.txt -nre  's,.*/([^:/]+).*,\1,p' | sort | uniq -c
      2 1.pdf
      3 2.pdf
      4 4.pdf
      1 7.pdf

$ sed < TEXT.txt -nre  's,.*/([^:/]+).*,\1,p' | sort | uniq -c | grep -vE '\s+[12] '
      3 2.pdf
      4 4.pdf
5
  • Neat solution @Hannu. Though its not complete because its printing all files. And not ignoring the ones that have less than 3 occurences. I thank you nonetheless. Commented Nov 6, 2024 at 9:14
  • 1
    That is just a grep away, added above. Commented Nov 6, 2024 at 13:21
  • 1
    grep -vE '[12] ' would also filter out 11 file.pdf or 12345 r2 d2.pdf (no need for -E btw) Commented Nov 6, 2024 at 14:15
  • Well, if you need a precise explicit example for a full variation of data then please make sure it is within the Q. See the answer as an example of what you need, that needs to be amended for data that isn't shown in the Q. Commented Nov 6, 2024 at 15:25
  • \s+ added, -E kept as this is growing, in expectation of more to come. Commented Nov 6, 2024 at 15:30
0

I would probably use grep here.

$ grep -Eo '[^/]+\.pdf' ./input.txt |
                               sort |
                            uniq -c | awk '$1>2 { print $2" : "$1 }'
2.pdf : 3
4.pdf : 4

Steps:

  • grep -Eo '[^/]+\.pdf' selects and prints just the PDF filenames.
  • sort | uniq -c counts the distinct filenames.
  • awk '$1>2 { print $1" : "$2 }' selects all lines with PDF names that occur at least 3 times, and formats them in the desired way.

A couple related problems:

  • What if spaces are allowed in the filenames?
    We can handle that by making the AWK expression a little uglier:
    grep -Eo '[^/]+\.pdf' ./inp.txt |
                               sort |
                            uniq -c | awk '$1>2 { n=$1; $1=""; print n" : "$0 }'
    
  • What if we want to count ./folder1/x.pdf and ./folder2/x.pdf as different?
    The best thing is probably to just edit the grep expression so it prints the full path, then report the files as unique up to paths:
    grep -Eo '^.*+\.pdf' ./inp.txt |
                               sort |
                            uniq -c | awk '$1>2 { print $2" : "$1 }'
    
1
  • 1
    Within gawk , sort, uniq and grep can be used. Commented Nov 7, 2024 at 20:10

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.