1

I'm trying to scrape a web-page for some data and I managed to post a request and got the right data. The problem is that I get something like :

"Kannst du bitte noch einmal ... erzýhlen, wie du wýhrend der Safari einen Lýwen verjagt hast?"

normally erzählen - während, so Ä,Ö,ß,Ü are not showing correctly.

here is my code:

var querystring = require('querystring');
var iconv = require('iconv-lite')
var request = require('request');
var fs = require('fs');
var writer = fs.createWriteStream('outputBodyutf8String.html');


var form = {
    id:'2974',
    opt1:'',
    opt2:'30',
    ref:'A1',
    tid:'157',
    tid2:'',
    fnum:'2'
};

var formData = querystring.stringify(form);
var contentLength = formData.length;

request({
    headers: {
        'Content-Length': contentLength,
        'Content-Type': 'application/x-www-form-urlencoded'
    },
    uri: 'xxxxxx.php',
    body: formData,
    method: 'POST'
}, function (err, res, body) {
    var utf8String = iconv.decode(body,"ISO-8859-1");
     console.log(utf8String);
    writer.write(utf8String);
});

how to get the HTML body in with the correct letters?

1 Answer 1

1

How do I find out the correct encoding of a response?

I went to the website you are attempting to scrape, and found this:

enter image description here

And another character encoding declaration here:

enter image description here

This website defined two different charater encodings! Which do I use?

Well, this doesn't apply to you. When reading an HTML file from a local machine, then the charset or content-type defined in the meta tags will be used for encoding.

Since you are retrieving this document, over HTTP, the files will be encoded according to the response header.

Here's the reponse header I received after visiting the website.

enter image description here

As you can see, they don't have a defined character set. It should be located in the Content-Type property. Like this:

enter image description here

Since they don't have any indicated charset in the response header, then, according to this post, it should use the meta declaration.

But wait, there was two meta charset declarations.

Since the compiler reads the file top to bottom, the second declared charset should be used.

Conclusion: They use UTF-8

Also, I don't think you need the conversion. I may be wrong, but you should just be able to access the response.

request({
    headers: {
        'Content-Length': contentLength,
        'Content-Type': 'application/x-www-form-urlencoded'
    },
    uri: 'xxxxxx.php',
    body: formData,
    method: 'POST'
}, function (err, res, body) {
    console.log(body);
    writer.write(body);
});

Edit: I don't believe the error is on their side. I believe it's on your side. Give this a try:

Remove the writer:

var writer = fs.createWriteStream('outputBodyutf8String.html');

And in the request callback, replace everything with this:

function (err, res, body) {
    console.log(body);
    fs.writeFile('outputBodyutf8String.html', body, 'utf8', function(error) {
        if(error)
            console.log('Error Occured', error);
    );
}

All the code should look like this:

var querystring = require('querystring');
var iconv = require('iconv-lite')
var request = require('request');
var fs = require('fs');

var form = {
    id:'2974',
    opt1:'',
    opt2:'30',
    ref:'A1',
    tid:'157',
    tid2:'',
    fnum:'2'
};

var formData = querystring.stringify(form);
var contentLength = formData.length;

request({
    headers: {
        'Content-Length': contentLength,
        'Content-Type': 'application/x-www-form-urlencoded'
    },
    uri: 'xxxxxxx.php',
    body: formData,
    method: 'POST'
}, function (err, res, body) {
    console.log(body);
    fs.writeFile('outputBodyutf8String.html', body, 'utf8', function(error) {
        if(error)
            console.log('Error Occured', error);
    );
}
Sign up to request clarification or add additional context in comments.

9 Comments

Thanks Lars ! I've tried that and it didn't work, I've also tried other encodings like "Windows-1250" , "ISO-8859-15" but also nothing. I've tried the same POST request in Postman and I get the correct results.
I try it later ! but that's one of the best answers ever, thanks man !
@anoumaru No problem :) Hope it helps.
Sorry to tell you that but it didn't work, still shows "Kannst du bitte noch einmal ... erz�hlen, wie du w�hrend der Safari einen L�wen verjagt hast?"
Sorry I couldn't help. I have not had that problem before. :( But thanks for accepting my answer.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.