2

I have a series of PDF files on my shared hosting webserver which I'm writing a PHP script for to catalogue them on the screen. I've added metadata to the PDF files - Document Title, Author and Subject. The filename is composed of the Author and Title so I can construct the catalogue text from that. However, I want to display the contents of the 'Subject' metadata field as well.

Because I'm using shared hosting, I cannot install any extra PHP extensions. They have the free version of PDFLib but this doesn't include any functions to load the PDF file or to extract metadata.

This is the script so far which just displays a list of the filenames...

function catalogue($folder){
  $files = preg_grep('/^([^.])/', scandir($folder));
  foreach($files as $file){
    echo($file.'<br/>');
  }
}

So, I've not made much progress :(

I've tried PDF_open_pdi_document() but this is not part of the installed PDFLib extension. I've tried PDF_pcos_get_string() but all I get with...

PDF_pcos_get_string($file,0,'author');

...is...

pdf_pcos_get_string(): supplied resource is not a valid pdf object resource

...and I can find literally ZERO help on the web for this function. Literally nothing!

I am running PHP 7.4 on the shared hosting.

2 Answers 2

3

Metadata aren't encrypted like the PDF, so you can use file_get_contents, find the pattern for the subject (<</Subject) and extract it using either a regex or a simple combination of strpos/substr.

Sign up to request clarification or add additional context in comments.

4 Comments

OK - so far simpler than I thought!
Thank you - that works. However, it's a little slow to load with 40 pdf files in a directory.
Yep it is. Storing the results should give better results than using gile_get_contents each time.
That's a good idea - thank you. I could do that. Have you any experience with pdf.js which I've commented below?
1

Thank you @drdlp. I've used file_get_contents() to load in the PDF and extract and display the metadata.

function catalogue($folder){
  $files = preg_grep('/^([^.])/', scandir($folder));
  foreach($files as $file){
    $page = file_get_contents($file);
    $metadata = preg_match_all('/\/[^\(]*\(([^\/\)]*)/',$page,$matches);
    $author = $matches[1][0];
    $subject = $matches[1][4];
    $title = $matches[1][5];
    echo($title.'/'.$subject.'/'.$author.'<br>');
  }
}
/

However, this is very slow for 40 odd PDF articles in a folder.

How can I speed this up?

I've begun experimenting with pdf.js for which I can load all the basic details from files first (filename etc) and then update them with Javascript after the page has loaded.

However, I clearly don't know enough about Javascript to make this work. This is what I have so far and I am very stuck. I've imported pdf.js from mozilla.github.io/pdf.js/build/pdf.js...

function pdf_metadata(file_url,id){
  var pdfjsLib = window['pdfjs-dist/build/pdf'];
  pdfjsLib.GlobalWorkerOptions.workerSrc = '//mozilla.github.io/pdf.js/build/pdf.worker.js';
  var loadingTask = pdfjsLib.getDocument(file_url);
  loadingTask.promise.then(function(pdf) {
    pdf.getMetadata().then(function(details) {
      console.log(details);
      document.getElementById(id).innerHTML=details;
    }).catch(function(err) {
       console.log('Error getting meta data');
       console.log(err);
       });
    });
}

The line console.log(details); outputs an object to the console. From there I have no idea how to extract any data at all. Therefore document.getElementById(id).innerHTML=details; displays nothing.

This is the object which is output to the console.

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.