In a list of file paths in an RTF file, count and sort output based on number of occurences of each file name

Question

I have an RTF file that contains a list of pdf file paths. Such as

Category1:
./Folder1/Folder2/1.pdf:18
./Folder3/2.pdf:18
./Folder5/4.pdf:10

Category2:
./Folder3/2.pdf:18
./Folder5/4.pdf:10

Category3:
./Folder1/Folder2/1.pdf:18
./Folder5/4.pdf:10

Category4:
./Folder6/7.pdf:10
./Folder5/4.pdf:10
./Folder3/2.pdf:18

SideNote: The number num after the *.pdf:num can be ignored. Folder's full path can be ignore too. The entity of interest is just the file's name.pdf

I would like to have a sorted output with respect to pdf names and its number of occurences

Output would be in format (name of file : number of times the file has appeared in the rtf) such as:

4.pdf :  4
2.pdf :  3

Note 2: Any file that is mentioned less than 3 times can be ignored.

Presumably the "Categories" can be discarded as superfluous? Also, do you want some check on file identity (size, creation date)? The reason I'm asking is because file 1.pdf in Folder_2 might be a different file from file 1.pdf in Folder_3. Thx. — jubilatious1
– jubilatious1, Commented Nov 5, 2024 at 20:56

Romeo Ninov · Accepted Answer · 2024-11-06 09:37:20Z

2

One possible tool is awk. This command can do the work:

awk -F\/ '/pdf/{split($NF,a,":");b[a[1]]+=1} END {for (i in b) if(b[i]>2) print i" : "b[i]}' input_file

The script use / as delimiter, then split in a the last field with : delimiter, count in b the occurrences of a. And on the end print in loo all the elements of a where count is > 2

It is possible to simplify by splitting the line using for delimiters / and :. In such case the script is:

awk -F'[/:]' '/pdf/{b[$(NF-1)]+=1} END {for (i in b) if(b[i]>2) print i" : "b[i]}' input_file

P.S. If you want the sum of the values after : from source file you can try something like:

awk -F'[/:]' '/pdf/{b[$(NF-1)]+=1;c[$(NF-1)]+=$NF} END {for (i in b) if(b[i]>2) print i" : "b[i]", "c[i]}'  input_file

Just added c array to sum the values after :

edited Nov 6, 2024 at 9:37

answered Nov 5, 2024 at 19:30

Romeo Ninov

19.5k5 gold badges35 silver badges48 bronze badges

1

This is perfect @Romeo Ninov. Very neat oneliner solution. And skips printing any file where counter is < 3 I thank you very much. And have a great day.

Ronnie
– Ronnie

2024-11-06 09:16:20 +00:00
Commented Nov 6, 2024 at 9:16
I wonder how extensible is awk though. For instance. Is it powerful enough to Print something like 4.pdf : 4 , 40 <\newline> 2.pdf : 3 , 54 Where 40 = the sum values that appeared after : delimiter after 4.pdf And 54 = the sum of values that appeared after : delimiter 2.pdf ? Much appreciated

Ronnie
– Ronnie

2024-11-06 09:31:45 +00:00
Commented Nov 6, 2024 at 9:31
1

@Ronnie, yep, no problem, see my edited answer. Next time will be good to create new question as this is the philosophy of this site, one question, one or more answers. Extending the scope is possible but not recommended :)

Romeo Ninov
– Romeo Ninov

2024-11-06 09:39:06 +00:00
Commented Nov 6, 2024 at 9:39
1

Worked like a charm. Duly noted regarding Changing Scope @Romeo Ninov

Ronnie
– Ronnie

2024-11-06 12:08:44 +00:00
Commented Nov 6, 2024 at 12:08
2

I had posted this as an answer but now notice it's too similar to your middle script to warrant being separate so here's a similar alternative: awk -F'[/:]' -v OFS=' :\t' 'NF>2{cnt[$(NF-1)]++} END{for (f in cnt) if (cnt[f]>2) print f, cnt[f]}' file.

Ed Morton
– Ed Morton

2024-11-07 10:18:44 +00:00
Commented Nov 7, 2024 at 10:18

Add a comment |

jubilatious1 · Accepted Answer · 2024-11-06 11:15:28Z

Using Raku (formerly known as Perl_6)

~$ raku -ne 'BEGIN my %seen; 
             if .chars && /\.pdf/ { $_.subst(/ <?after \.pdf> \: \d+ $ /).IO.basename andthen %seen{$_}++ }; 
             END .say for %seen.sort: +*.key.match(/^ \d+ <?before \.pdf>/);'   file

OR (more simply):

~$ raku -ne 'BEGIN my %seen; 
             if .chars && s/ <?after \.pdf> \: \d+ $ // { %seen{$_.IO.basename}++ }; 
             END .say for %seen.sort: +*.key.match(/^ \d+ <?before \.pdf>/);'   file

OR (even more simply):

~$ raku -ne 'BEGIN my %seen; 
             if .chars && s/ <?after \.pdf> \: \d+ $ // { %seen{$_.IO.basename}++ }; 
             END .say for %seen.sort: +*.key.IO.extension: "";'   file

Raku is a programming language in the Perl family. Using Raku's awk-like -ne non-autoprinting commandline flags, you can obtain a sorted hash of key/value pairs where the value per PDF file name equals the count of the number of times that filename was seen. Output is sorted numerically according to filename (as a number). Using .say in the END block will give you paired "key => value" output:

Sample Input:

Category1:
./Folder1/Folder2/1.pdf:18
./Folder3/2.pdf:18
./Folder5/4.pdf:10

Category2:
./Folder3/2.pdf:18
./Folder5/4.pdf:10

Category3:
./Folder1/Folder2/1.pdf:18
./Folder5/4.pdf:10

Category4:
./Folder6/7.pdf:10
./Folder5/4.pdf:10
./Folder3/2.pdf:18

Sample Output:

1.pdf => 2
2.pdf => 3
4.pdf => 4
7.pdf => 1

If you need paired paired "key : value" output, change the END block of your code to:

 END put($_.key, " : ", $_.value) for %hash.sort: +*.key.IO.extension: "";'

If you need to eliminate pairs with .value < 2, further change the END block of your code to:

 END put($_.key, " : ", $_.value if .value > 2 ) for %hash.sort: +*.key.IO.extension: "";'

Finally, if you prefer code written in a "chained" method/function call -style, the code below gives the same (desired) code as that above:

~$ raku -e 'my  %seen = lines.grep( *.chars > 0 && / \.pdf /)  \ 
                        .map( *.subst(/ <?after \.pdf> \: \d+ $ / ).IO.basename).Bag;  \ 
            for %seen.sort( +*.key.IO.extension: "") { 
                put $_.key ~" : "~ $_.value if .value > 2 };'   file

https://raku.org

Thank you so much for such good explaination @jubilatious1 I did try it. It works as you said, however Romeo Ninov's answer is way simpler. And complete. in a way that it excludes printing all counters that are less than 2. Hence, 2.pdf => 3 4.pdf => 4 I thank you nonetheless. Good day~ — Ronnie
– Ronnie, Commented Nov 6, 2024 at 9:10
Will update, as it is simple to add a clause to eliminate .value < 2. FYI, I enjoy @RomeoNinov 's answers as much as you do. However I took great pains to provide results in numerical sort order, and I don't see an awk answer with (numerically) sorted output. Cheers. — jubilatious1
– jubilatious1, Commented Nov 6, 2024 at 10:40

AdminBee · Accepted Answer · 2024-11-06 15:28:30Z

1

With perl:

$ perl -lne '$c{$1}++ if m{([^/]*\.pdf):\d+$};
             END {
               print "$_ : $c{$_}" for
                 sort {$c{$b} <=> $c{$a}} grep {$c{$_} > 2} keys %c
             }' your-file
4.pdf : 4
2.pdf : 3

edited Nov 6, 2024 at 15:28

AdminBee

23.6k25 gold badges56 silver badges77 bronze badges

answered Nov 6, 2024 at 14:24

Stéphane Chazelas

587k96 gold badges1.1k silver badges1.7k bronze badges

Add a comment |

Hannu · Accepted Answer · 2024-11-06 15:28:41Z

0

Extract that file to be a TEXT file, then do

$ sed < TEXT.txt -nre  's,.*/([^:/]+).*,\1,p' | sort | uniq -c
      2 1.pdf
      3 2.pdf
      4 4.pdf
      1 7.pdf

$ sed < TEXT.txt -nre  's,.*/([^:/]+).*,\1,p' | sort | uniq -c | grep -vE '\s+[12] '
      3 2.pdf
      4 4.pdf

edited Nov 6, 2024 at 15:28

answered Nov 5, 2024 at 17:48

Hannu

5242 silver badges9 bronze badges

Neat solution @Hannu. Though its not complete because its printing all files. And not ignoring the ones that have less than 3 occurences. I thank you nonetheless.

Ronnie
– Ronnie

2024-11-06 09:14:53 +00:00
Commented Nov 6, 2024 at 9:14
1

That is just a grep away, added above.

Hannu
– Hannu

2024-11-06 13:21:09 +00:00
Commented Nov 6, 2024 at 13:21
1

grep -vE '[12] ' would also filter out 11 file.pdf or 12345 r2 d2.pdf (no need for -E btw)

Stéphane Chazelas
– Stéphane Chazelas

2024-11-06 14:15:33 +00:00
Commented Nov 6, 2024 at 14:15
Well, if you need a precise explicit example for a full variation of data then please make sure it is within the Q. See the answer as an example of what you need, that needs to be amended for data that isn't shown in the Q.

Hannu
– Hannu

2024-11-06 15:25:07 +00:00
Commented Nov 6, 2024 at 15:25
\s+ added, -E kept as this is growing, in expectation of more to come.

Hannu
– Hannu

2024-11-06 15:30:46 +00:00
Commented Nov 6, 2024 at 15:30

Add a comment |

wobtax · Accepted Answer · 2024-11-07 19:37:03Z

I would probably use grep here.

$ grep -Eo '[^/]+\.pdf' ./input.txt |
                               sort |
                            uniq -c | awk '$1>2 { print $2" : "$1 }'
2.pdf : 3
4.pdf : 4

Steps:

grep -Eo '[^/]+\.pdf' selects and prints just the PDF filenames.
sort | uniq -c counts the distinct filenames.
awk '$1>2 { print $1" : "$2 }' selects all lines with PDF names that occur at least 3 times, and formats them in the desired way.

A couple related problems:

What if spaces are allowed in the filenames?
We can handle that by making the AWK expression a little uglier:

grep -Eo '[^/]+\.pdf' ./inp.txt |
                           sort |
                        uniq -c | awk '$1>2 { n=$1; $1=""; print n" : "$0 }'

What if we want to count ./folder1/x.pdf and ./folder2/x.pdf as different?
The best thing is probably to just edit the grep expression so it prints the full path, then report the files as unique up to paths:
```
grep -Eo '^.*+\.pdf' ./inp.txt |
                           sort |
                        uniq -c | awk '$1>2 { print $2" : "$1 }'
```

Within gawk , sort, uniq and grep can be used.

Prabhjot Singh
– Prabhjot Singh

2024-11-07 20:10:26 +00:00
Commented Nov 7, 2024 at 20:10 — Prabhjot Singh
– Prabhjot Singh, Commented Nov 7, 2024 at 20:10

Stack Exchange Network

In a list of file paths in an RTF file, count and sort output based on number of occurences of each file name

5 Answers 5

You must log in to answer this question.

Hot Network Questions

In a list of file paths in an RTF file, count and sort output based on number of occurences of each file name

5 Answers 5

You must log in to answer this question.

Related

Hot Network Questions