0

I have a bash script that iterates over a list of links, curl's down an html page per link, greps for a particular string format (syntax is: CVE-####-####), removes the surrounding html tags (this is a consistent format, no special case handling necessary), searches a changelog file for the resulting string ID, and finally does stuff based on whether the string ID was found or not.

The found string ID is set as a variable. The issue is that when grepping for the variable there are no results, even though I positively know there should be for some of the ID's. Here is the relevant portion of the script:

for link in $(cat links.txt); do
    curl -s "$link" | grep 'CVE-' | sed 's/<[^>]*>//g' | while read cve; do
        echo "$cve"
        grep "$cve" ./changelog.txt
    done
done

If I hardcode a known ID in the grep command, the script finds the ID and returns things as expected. I've tried many variations of grepping on this variable (e.g. exporting it and doing command expansion, cat'ing the changelog and piping to grep, setting variable directly via command expansion of the curl chain, single and double quotes surrounding variables, half a dozen other things).

Am I missing something nuanced with the outputted variable from the curl | grep | sed chain? When it is echo'd to stdout or >> to a file, things look fine (a single ID with no odd characters or carriage returns etc.).

Any hints or alternate solutions would be much appreciated. Thanks!

FYI:

OSX:$bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)

Edit:

The html file that I was curl'ing was chock full of carriage returns. Running the script with set -x was helpful because it revealed the true string being grepped: $'CVE-2011-2716\r'.

+ read -r link
+ curl -s http://localhost:8080/link1.html
+ sed -n '/CVE-/s/<[^>]*>//gp'
+ read -r cve
+ grep -q -F $'CVE-2011-2716\r' ./kernelChangelog.txt

Also investigating from another angle, opening the curled file in vim showed ^M and doing a printf %s "$cve" | xxd also showed the carriage return hex code 0d appended to the grep'd variable. Relying on 'echo' stdout was a wrong way of diagnosing things. Writing a simple html page with a valid CVE-####-####, but then adding a carriage return (in vim insert mode just type ctrl-v ctrl-m to insert the carriage return) will create a sample file that fails with the original script snippet above.

This is pretty standard string sanitization stuff that I should have figured out. The solution is to remove carriage returns, piping to tr -d '\r' is one method of doing that. I'm not sure there is a specific duplicate on SO for this series of steps, but in any case here is my now working script:

while read -r link; do
  curl -s "$link" | sed -n '/CVE-/s/<[^>]*>//gp' | tr -d '\r' | while read -r cve; do
    if grep -q -F "$cve" ./changelog.txt; then
      echo "FOUND: $cve";
    else
      echo "NOT FOUND: $cve";
    fi;
  done
done < links.txt
7
  • 1
    Don't trust echo. Especially with an unquoted argument. printf '[%s]\n' "$cve" is better as is printf %s "$cve" | xxd. Commented Jun 1, 2015 at 20:19
  • I'd break this down when troubleshooting and start by using a curl on single link piped to grep and test on stdout to figure out what the real issue is. Commented Jun 1, 2015 at 20:28
  • 2
    You may want to also post sample data that can replicate the problem. Commented Jun 1, 2015 at 20:34
  • 2
    General script troubleshooting advice: Put set -x at the beginning of the script, so it shows each command as it's executing, with the variables expanded. Commented Jun 1, 2015 at 20:43
  • 1
    You should almost always quote your variables, in case they contain whitespace or wildcard characters. Commented Jun 1, 2015 at 20:44

2 Answers 2

2

HTML files can contain carriage returns at the ends of lines, you need to filter those out.

curl -s "$link" | sed -n '/CVE-/s/<[^>]*>//gp' | tr -d '\r' | while read cve; do

Notice that there's no need to use grep, you can use a regular expression filter in the sed command. (You can also use the tr command in sed to remove characters, but doing this for \r is cumbersome, so I piped to tr instead).

Sign up to request clarification or add additional context in comments.

Comments

2

It should look like this:

# First: Care about quoting your variables!

# Use read to read the file line by line
while read -r link ; do
    # No grep required. sed can do that.
    curl -s "$link" | sed -n '/CVE-/s/<[^>]*>//gp' | while read -r cve; do
        echo "$cve"
        # grep -F searches for fixed strings instead of patterns
        grep -F "$cve" ./changelog.txt
    done
done < links.txt

4 Comments

Thanks for cleaning things up, but things still do not work. There has to be something wrong with that $cve variable. I'll dig deeper.
I would need to see the contents of links.txt and changelog.txt
@Barmar gave me the tip to use set -x in the script. That showed there is a carriage return \r being appended to the $cve variable. I'll give him a chance to post an actual answer that explains why and/or how to resolve. If he doesn't do that, perhaps you can edit this current answer to include that and I'll mark it accepted. In either case, thanks for the cleanup.
Please concentrate more on this comment: stackoverflow.com/questions/30582516/… Otherwise the question isn't helpful for the community.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.