0

I have a page exported from a wiki and I would like to find all the links on that page using bash. All the links on that page are in the form [wiki:<page_name>]. I have a script that does:

...
# First search for the links to the pages                                                                                                                                    
search=`grep '\[wiki:' pages/*`

# Check is our search turned up anything                                                                                                                                     
if [ -n "$search" ]; then
    # Now, we want to cut out the page name and find unique listings                                                                                                         
    uniquePages=`echo "$search" | cut -d'[' -f 2 | cut -d']' -f 1 | cut -d':' -f2 | cut -d' ' -f 1 | sort -u`
....

However, when presented with a grep result with multiple [wiki: text in it, it only pulls the last one and not any others. For example if $search is:

Before starting the configuration, all the required libraries must be installed to be detected by Cmake. If you have missed this step, see the [wiki:CT/Checklist/Libraries "Libr By pressing [t] you can switch to advanced mode screen with more details. The 5 pages are available [wiki:CT/Checklist/Cmake/advanced_mode here]. To obtain information about ea - '''Installation of Cantera''': If Cantera has not been correctly installed or if you do not have sourced the setup file '''~/setup_cantera''' you should receive the following message. Refer to the [wiki:CT/FormulationCantera "Cantera installation"] page to fix this problem. You can set the Cantera options to OFF if you plan to use built-in transport, thermodynamics and chemistry.

then it only returns CT/FormulationCantera and it doesn't give me any of the other links. I know this is due to using cut so I need a replacement for the $uniquepages line.

Does anybody have any suggestions in bash? It can use sed or perl if needed, but I'm hoping for a one-liner to extract a list of page names if at all possible.

1 Answer 1

2
egrep -o '\[wiki:[^]]*]' pages/* | sed 's/\[wiki://;s/]//' | sort -u

upd. to remove all after space without cut

egrep -o '\[wiki:[^]]*]' pages/* | sed 's/\[wiki://;s/]//;s/ .*//' | sort -u
Sign up to request clarification or add additional context in comments.

2 Comments

Beautiful, that did it. The only change is adding a cut -d' ' -f1 prior to the sort in case there is a link in the form [wiki:<page_name> <text_for_link>] which I didn't say was possible in the question, but the sample data had it in there. Thanks!
@tpg2114 you can append just another sed command instead of cut: 's/ .*//'.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.