Extract PDF metadata field using PHP

Question

I have a series of PDF files on my shared hosting webserver which I'm writing a PHP script for to catalogue them on the screen. I've added metadata to the PDF files - Document Title, Author and Subject. The filename is composed of the Author and Title so I can construct the catalogue text from that. However, I want to display the contents of the 'Subject' metadata field as well.

Because I'm using shared hosting, I cannot install any extra PHP extensions. They have the free version of PDFLib but this doesn't include any functions to load the PDF file or to extract metadata.

This is the script so far which just displays a list of the filenames...

function catalogue($folder){
  $files = preg_grep('/^([^.])/', scandir($folder));
  foreach($files as $file){
    echo($file.'<br/>');
  }
}

So, I've not made much progress :(

I've tried PDF_open_pdi_document() but this is not part of the installed PDFLib extension. I've tried PDF_pcos_get_string() but all I get with...

PDF_pcos_get_string($file,0,'author');

...is...

pdf_pcos_get_string(): supplied resource is not a valid pdf object resource

...and I can find literally ZERO help on the web for this function. Literally nothing!

I am running PHP 7.4 on the shared hosting.

Dr DLP · Accepted Answer · 2020-10-12 20:52:50Z

3

Metadata aren't encrypted like the PDF, so you can use file_get_contents, find the pattern for the subject (<</Subject) and extract it using either a regex or a simple combination of strpos/substr.

answered Oct 12, 2020 at 20:52

Dr DLP

1063 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

MrMills Over a year ago

OK - so far simpler than I thought!

MrMills Over a year ago

Thank you - that works. However, it's a little slow to load with 40 pdf files in a directory.

Dr DLP Over a year ago

Yep it is. Storing the results should give better results than using gile_get_contents each time.

MrMills Over a year ago

That's a good idea - thank you. I could do that. Have you any experience with pdf.js which I've commented below?

MrMills · Accepted Answer · 2020-10-12 21:47:34Z

Thank you @drdlp. I've used file_get_contents() to load in the PDF and extract and display the metadata.

function catalogue($folder){
  $files = preg_grep('/^([^.])/', scandir($folder));
  foreach($files as $file){
    $page = file_get_contents($file);
    $metadata = preg_match_all('/\/[^\(]*\(([^\/\)]*)/',$page,$matches);
    $author = $matches[1][0];
    $subject = $matches[1][4];
    $title = $matches[1][5];
    echo($title.'/'.$subject.'/'.$author.'<br>');
  }
}
/

However, this is very slow for 40 odd PDF articles in a folder.

How can I speed this up?

I've begun experimenting with pdf.js for which I can load all the basic details from files first (filename etc) and then update them with Javascript after the page has loaded.

However, I clearly don't know enough about Javascript to make this work. This is what I have so far and I am very stuck. I've imported pdf.js from mozilla.github.io/pdf.js/build/pdf.js...

function pdf_metadata(file_url,id){
  var pdfjsLib = window['pdfjs-dist/build/pdf'];
  pdfjsLib.GlobalWorkerOptions.workerSrc = '//mozilla.github.io/pdf.js/build/pdf.worker.js';
  var loadingTask = pdfjsLib.getDocument(file_url);
  loadingTask.promise.then(function(pdf) {
    pdf.getMetadata().then(function(details) {
      console.log(details);
      document.getElementById(id).innerHTML=details;
    }).catch(function(err) {
       console.log('Error getting meta data');
       console.log(err);
       });
    });
}

The line console.log(details); outputs an object to the console. From there I have no idea how to extract any data at all. Therefore document.getElementById(id).innerHTML=details; displays nothing.

This is the object which is output to the console.

Collectives™ on Stack Overflow

Extract PDF metadata field using PHP

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related