3

I have been working on some simple bash script recently, which parses specific data from webpages. I have used tr '\r\n' ' ' <file1.txt >file2.txt to make sure, all extracted data from page is stored in file1.txt in one row. So then I need to match all strings between <th>...</th> tags in this line and delete them or replace with ' ' sign. So here is some expamle code:

    <td>Abaktal hm</td> </tr> <tr> <th>Package</th> <td>flm 10x400 mg</td> <th>Indesit</th>

I have used sed and tried something like

    sed -i 's/\<th\>.*?\<\/th\>/ /g' output.txt

But it didn't work. I think problem is in ? sign. It works with ? sign in regular expressions, but probably not in bash.

3
  • 2
    It's a bad idea to parse html with shell. Commented Oct 18, 2012 at 20:08
  • You're using a unix variant, use one of the many languages available, such as perl, python, ruby, etc. to parse that. Commented Oct 18, 2012 at 20:13
  • I know that this is not the ideal solution, but solving this task is the key to finish what I am working on. So is there some form of e.g. sed command to solve this problem? Just need to select all those strings at once. Commented Oct 18, 2012 at 22:24

3 Answers 3

4

While I agree with sputnick and others, the answer to your immediate question would be:

sed -ir 's/<th>[^<]+<\/th>//g'

This works on your sample data just fine.

Sign up to request clarification or add additional context in comments.

Comments

0
 <td>
     Abaktal hm
 </td>
 <th>
     Package
 </th> 
 <td>
     flm 10x400 mg</td>
 <th> 
     Indesit
 </th>

If you have this type of input the below command will work

sed -n '//{p; :a; N; /</th>/!ba; s/.*\n//}; p' output.txt

It will delete the content between

 <th>...</th> tags

For more info removing lines between two patterns (not inclusive) with sed

Comments

0

Your attempt seems definitely wrong.

You can't realistically parse tag-based markup languages like HTML and XML using Bash or utilities such as grep, sed or cut. If you just want to dump/render HTML, see (links|links2|lynx|w3m) -dump, html2text, vilistextum. For parsing out pieces of data, see tidy+(xmlstarlet|xmllint|xmlgawk|xpath|xml2), or learn xslt.

See

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.