0

I've been trying to load a response into a goquery document, but it appears to be failing (though it throws no errors).

The response I'm trying to load comes from:

https://www.bbcgoodfood.com/search_api_ajax/search/recipes?sort=created&order=desc&page=4

and while it doesn't throw any errors, when I call fmt.Println(goquery.OuterHtml(doc.Contents())) I get the output:

<html><head></head><body></body></html>

Meanwhile, If I don't attempt to load it into a goquery document, and instead call

s, _ := ioutil.ReadAll(resp.Body)
fmt.Println(string(s))

I get:

<!doctype html>
<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8 no-touch" lang="en"> <![endif]-->
<!--[if IE 8]>    <html class="no-js lt-ie9 no-touch" lang="en"> <![endif]-->
<!--[if gt IE 8]> <html class="no-js gt-ie-8 no-touch" lang="en"> <![endif]-->
<!--[if !IE]><!-->
<html class="no-js no-touch" lang="en">
<!--<![endif]-->

<head>
    <meta charset="utf-8">
    <title>Search | BBC Good Food</title>
    <!--[if IE]><![endif]-->
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <link rel="prev" href="https://www.bbcgoodfood.com/search/recipes?page=3&amp;sort=created&amp;order=desc" />
    <link rel="next" href="https://www.bbcgoodfood.com/search/recipes?page=5&amp;sort=created&amp;order=desc" />
    <meta name="robots" content="noindex" />
    <style>
        .async-hide {
            opacity: 0 !important
        }
    ... etc

The basic logic of what I'm doing is as follows:

package main

import (
    "fmt"
    "net/http"
    "github.com/PuerkitoBio/goquery"
    "io/ioutil"
)

func main() {
    baseUrl := "https://www.bbcgoodfood.com/search_api_ajax/search/recipes?sort=created&order=desc&page="
    i := 4

    // Make a request
    req, _ := http.NewRequest(http.MethodGet, fmt.Sprintf("%s%d", baseUrl, i), nil)

    // Create a new HTTP client and execute the request
    client := &http.Client{}
    resp, _ := client.Do(req)

    // Print out response
    s, _ := ioutil.ReadAll(resp.Body)
    fmt.Println(string(s))

    // Load into goquery doc
    doc, _ := goquery.NewDocumentFromReader(resp.Body)
    fmt.Println(goquery.OuterHtml(doc.Contents()))
}

The full response can be found here. Is there any particular reason why this won't load?

2
  • Please show your actual code. Commented Oct 25, 2019 at 14:28
  • I have removed some of the logic (as the response occasionally comes back in JSON) but I've updated it to include a (complete) minimal example. Commented Oct 25, 2019 at 16:08

2 Answers 2

1

Go's html parser doesn't seem to like the html you're getting - the <html> tags are all within comments, so I think it's just never getting going on the parsing.

If you prepend the document with <html> everything works fine from there. One way to do that would be with a reader-wrapper, something like the following, which writes the html tag the first time Read is called and delegates to resp.Body on subsequent calls.

import "io"

var htmlTag string = "<html>\n"

type htmlAddingReader struct {
    sentHtml bool
    source io.Reader
}

func (r *htmlAddingReader) Read(b []byte) (n int, err error) {
    if !r.sentHtml {
        copy(b, htmlTag)
        r.sentHtml = true
        return len(htmlTag), nil
    } else {
        return r.source.Read(b)
    }
}

To use this in your sample code, change the final section like so:

    // Load into goquery doc
    wrapped := &htmlAddingReader{}
    wrapped.source = resp.Body
    doc, _ := goquery.NewDocumentFromReader(wrapped)
    fmt.Println(goquery.OuterHtml(doc.Contents()))
Sign up to request clarification or add additional context in comments.

Comments

0

There are two issues with the code:

(1) resp.Body is an io.ReadCloser stream.

ioutil.ReadAll(resp.Body) reads the whole stream, so there is nothing left for goquery.NewDocumentFromReader(resp.Body) to read, so it returns an empty doc.

Instead, you can use NewReader(s) to create a new stream from the saved body string.

(2) doc.Contents() returns the children of the top element which is just <!DOCTYPE html>. If you want the whole doc, then you probably want to use doc.Selection.

Something like this should work:

    // Read entire resp.Body into raw
    raw, _ := io.ReadAll(resp.Body)
    s := string(raw)

    // Print out response
    fmt.Println(s)

    // Create a new readable stream with NewReader(s)
    doc, _ := goquery.NewDocumentFromReader(strings.NewReader(s))
    
    // Use doc.Selection to get the whole doc
    fmt.Println(doc.Selection.Html())

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.