Extract multiple substrings in bash

Question

I have a page exported from a wiki and I would like to find all the links on that page using bash. All the links on that page are in the form [wiki:<page_name>]. I have a script that does:

...
# First search for the links to the pages                                                                                                                                    
search=`grep '\[wiki:' pages/*`

# Check is our search turned up anything                                                                                                                                     
if [ -n "$search" ]; then
    # Now, we want to cut out the page name and find unique listings                                                                                                         
    uniquePages=`echo "$search" | cut -d'[' -f 2 | cut -d']' -f 1 | cut -d':' -f2 | cut -d' ' -f 1 | sort -u`
....

However, when presented with a grep result with multiple [wiki: text in it, it only pulls the last one and not any others. For example if $search is:

Before starting the configuration, all the required libraries must be installed to be detected by Cmake. If you have missed this step, see the [wiki:CT/Checklist/Libraries "Libr By pressing [t] you can switch to advanced mode screen with more details. The 5 pages are available [wiki:CT/Checklist/Cmake/advanced_mode here]. To obtain information about ea - '''Installation of Cantera''': If Cantera has not been correctly installed or if you do not have sourced the setup file '''~/setup_cantera''' you should receive the following message. Refer to the [wiki:CT/FormulationCantera "Cantera installation"] page to fix this problem. You can set the Cantera options to OFF if you plan to use built-in transport, thermodynamics and chemistry.

then it only returns CT/FormulationCantera and it doesn't give me any of the other links. I know this is due to using cut so I need a replacement for the $uniquepages line.

Does anybody have any suggestions in bash? It can use sed or perl if needed, but I'm hoping for a one-liner to extract a list of page names if at all possible.

rush · Accepted Answer · 2012-08-16 14:56:11Z

2

egrep -o '\[wiki:[^]]*]' pages/* | sed 's/\[wiki://;s/]//' | sort -u

upd. to remove all after space without cut

egrep -o '\[wiki:[^]]*]' pages/* | sed 's/\[wiki://;s/]//;s/ .*//' | sort -u

edited Aug 16, 2012 at 14:56

answered Aug 16, 2012 at 14:47

rush

2,5742 gold badges20 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

tpg2114 Over a year ago

Beautiful, that did it. The only change is adding a cut -d' ' -f1 prior to the sort in case there is a link in the form [wiki:<page_name> <text_for_link>] which I didn't say was possible in the question, but the sample data had it in there. Thanks!

rush Over a year ago

@tpg2114 you can append just another sed command instead of cut: 's/ .*//'.

Collectives™ on Stack Overflow

Extract multiple substrings in bash

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related