5

I have a scenario to automate the PDF contents. How to retrieve the content of the PDF file in nodejs.

I am completely blocked for this. Although there are few posts on pdf2jsona and jsonreader but those are not working for me. Any help will be appreciated for the same.

var pdfParser = new PDFParser();
fs.readFile(pdfFilePath, function(err, pdfBuffer) {
    pdfParser.parseBuffer(pdfBuffer);
}, function(pdfBuffer){
    pdfParser.parseBuffer(pdfBuffer);
})

Error: Invalid parameter array, need either .data or .url at FSReqWrap.readFileAfterClose [as oncomplete] (fs.js:445:3)

3
  • Hi. Is there a particular issue you identified with these 2 libraries? Commented Jan 13, 2017 at 13:19
  • Hi Sebas, I added the code snippet and error to the question itself. Please have a look and let me know anything I am missing as I am new to nodejs Commented Jan 13, 2017 at 14:08
  • I am not certain the error comes from the pdfParser but rather the fs object. Commented Jan 13, 2017 at 19:43

3 Answers 3

1

I found the answer and it's working perfectly. Install fs and pdf2json by running the below commands. npm install pdf2json and npm install fs

var fs = require('fs');
var PDFParser = require('pdf2json');
var path = osHomedir();
var homepath = path.replace(new RegExp('\\' + path.sep, 'g'), '/');
var pdfFilePath = homepath + '/Downloads/' + 'filename.pdf';

if (fs.existsSync(pdfFilePath)) {
  //Read the content of the pdf from the downloaded path
  var pdfParser = new PDFParser(browser, 1);
  pdfParser.on("pdfParser_dataError", function (errData) {
     console.error(errData.parserError)
  });
  pdfParser.on("pdfParser_dataReady", function (pdfData) {
  //console.log('here is the content: '+pdfParser.getRawTextContent());
  browser.assert.ok(pdfParser.getRawTextContent().indexOf(textToVerify) > -1);
  });

  pdfParser.loadPDF(pdfFilePath);
} else {
    console.log('OOPs file not present in the downloaded folder');
    //Throw an error if the file is not found in the path mentioned
    browser.assert.ok(fs.existsSync(pdfFilePath));
}
Sign up to request clarification or add additional context in comments.

3 Comments

Can you please tell me that what is browser here . I am also using same but i am getting this error : ReferenceError: browser is not defined Thanks
You need to explain what the browser variable is.
The documentation at npmjs.com/package/pdf2json is a pretty good starting point for using pdf2json
1
 const fs = require("fs");
 const PdfReader = require('pdfreader').PdfReader;
  fs.readFile("E://file streaming in node js//demo//read.pdf", (err, pdfBuffer) => {
    // pdfBuffer contains the file content
    new PdfReader().parseBuffer(pdfBuffer, function(err, item){
       if (err)
           callback(err);
        else if (!item)
            callback();
         else if (item.text)
            console.log(item.text);
          });
       });

Comments

1

pdfreader and pdf2json didn't work properly: weirdly enough, numbers were parsed incorrectly and line breaks were added sporadically around non-alphanumeric characters.

I ended up using pdf-parse, which works like a charm:

const pdf = require("pdf-parse");

const path = '~/Documents/file.pdf';

pdf(path)
  .then((data) => {
    console.log(data);
  })
  .catch((error) => {
    console.error(error);
  });

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.