How can I validate external references in a PDF using Apache PDFBox?

Question

I'm working on a project where I need to validate external references in PDF files using Apache PDFBox. Specifically, I want to check if a PDF contains any external references, such as links to external URLs, and ensure these references are safe and do not point to potentially malicious content.

I am using Apache PDFBox version 2.0.32. Here's what I'm trying to achieve:

Identify External References: I want to detect any external links or references in the PDF. Validate Content: Check if these references lead to potentially harmful or unexpected content. I have tried the following approaches:

Using the PDAnnotation class to get annotations, but I am not sure how to specifically identify external links or references from these annotations. Reviewing the PDFBox documentation and sample code, but I couldn’t find detailed guidance on handling external references. Could someone provide guidance or sample code on how to validate external references in a PDF using Apache PDFBox? Any help or pointers to relevant documentation would be greatly appreciated.

What I Tried:

I have implemented a method to validate PDF files for potentially malicious content using Apache PDFBox. My implementation includes the following checks:

Document-Level JavaScript: I check for JavaScript in the document catalog using the checkDocumentLevelJavaScript method.
Annotations for JavaScript: I check if any annotations contain or trigger JavaScript actions using the checkAnnotationsForJavaScript method.
Form Fields for JavaScript: I verify if form fields include JavaScript using the checkFormFieldsForJavaScript method.
External References: I examine file attachments and annotations for suspicious URLs or external references using the checkExternalReferences method.

Here is a snippet of my code:

public void validatePDF(byte[] fileContent) {
    try (ByteArrayInputStream bais = new ByteArrayInputStream(fileContent)) {
        PDDocument document = PDDocument.load(bais);

        // Check document-level JavaScript
        checkDocumentLevelJavaScript(document);

        // Check annotations for JavaScript
        checkAnnotationsForJavaScript(document);

        // Check form fields for JavaScript
        checkFormFieldsForJavaScript(document);

        // Check ExternalReferences for JavaScript
        checkExternalReferences(document);

        document.close();
        System.out.println("PDF content validated successfully.");
    } catch (IOException e) {
        System.err.println("The provided PDF content is invalid.");
    }
}

private void checkExternalReferences(PDDocument document) throws IOException {
    // Check file attachments (embedded files)
    PDDocumentCatalog catalog = document.getDocumentCatalog();
    if (catalog.getNames() != null) {
        List<PDDocumentCatalog.NamedObject> embeddedFiles = catalog.getNames().getEmbeddedFiles();
        if (embeddedFiles != null) {
            for (PDDocumentCatalog.NamedObject embeddedFile : embeddedFiles) {
                String filename = embeddedFile.getName();
                if (filename != null && !filename.isEmpty()) {
                    if (filename.contains("suspicious")) {
                        throw new SecurityException("Suspicious file attachment detected: " + filename);
                    }
                }
            }
        }
    }

    // Check annotations for URLs or external references
    for (PDPage page : document.getPages()) {
        for (PDAnnotation annotation : page.getAnnotations()) {
            if (annotation.getAction() != null) {
                String action = annotation.getAction().toString();
                if (action.contains("http") || action.contains("www")) {
                    if (action.contains("suspicious")) {
                        throw new SecurityException("Suspicious URL detected in annotation: " + action);
                    }
                }
            }
        }
    }

    // Check form fields for URLs or external references
    PDAcroForm acroForm = catalog.getAcroForm();
    if (acroForm != null) {
        for (PDField field : acroForm.getFields()) {
            if (field.getActions() != null) {
                String action = field.getActions().toString();
                if (action.contains("http") || action.contains("www")) {
                    if (action.contains("suspicious")) {
                        throw new SecurityException("Suspicious URL detected in form field: " + action);
                    }
                }
            }
        }
    }
}

private void checkAnnotationsForJavaScript(PDDocument document) throws IOException {
    for (PDPage page : document.getPages()) {
        for (PDAnnotation annotation : page.getAnnotations()) {
            String annotationContent = annotation.getContents();
            if (annotationContent != null && containsMaliciousJavaScript(annotationContent)) {
                System.err.println("Potentially malicious JavaScript found in "
                        + annotation.getClass().getSimpleName() + " annotation: " + annotationContent);
                throw new IOException("The provided PDF contains potentially malicious content.");
            }

            if (annotation instanceof PDAnnotationLink) {
                PDAnnotationLink link = (PDAnnotationLink) annotation;
                PDAction action = link.getAction();

                if (action instanceof PDActionJavaScript) {
                    PDActionJavaScript jsAction = (PDActionJavaScript) action;
                    String jsContent = jsAction.getAction();
                    if (containsMaliciousJavaScript(jsContent)) {
                        System.err.println("Potentially malicious JavaScript found in "
                                + annotation.getClass().getSimpleName() + " annotation's action: " + jsContent);
                        throw new IOException("The provided PDF contains potentially malicious content.");
                    }
                }
            }
        }
    }
}

private void checkDocumentLevelJavaScript(PDDocument document) throws IOException {
    COSDictionary catalog = document.getDocumentCatalog().getCOSObject();
    COSDictionary names = (COSDictionary) catalog.getDictionaryObject(COSName.NAMES);
    if (names != null) {
        COSDictionary jsNameTree = (COSDictionary) names.getDictionaryObject(COSName.JAVA_SCRIPT);
        if (jsNameTree != null) {
            for (COSName key : jsNameTree.keySet()) {
                COSDictionary jsDict = (COSDictionary) jsNameTree.getDictionaryObject(key);
                if (jsDict != null && jsDict.containsKey(COSName.JS)) {
                    String jsContent = jsDict.getString(COSName.JS);
                    if (containsMaliciousJavaScript(jsContent)) {
                        System.err.println("Potentially malicious JavaScript found in document-level JavaScript: "
                                + jsContent);
                        throw new IOException("The provided PDF contains potentially malicious content.");
                    }
                }
            }
        }
    }
}

private void checkFormFieldsForJavaScript(PDDocument document) throws IOException {
    PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
    if (acroForm != null) {
        for (PDField field : acroForm.getFields()) {
            if (field.getActions() != null) {
                COSDictionary actionsDict = field.getActions().getCOSObject();
                for (COSName key : actionsDict.keySet()) {
                    if (COSName.JS.equals(key)) {
                        COSDictionary jsDict = (COSDictionary) actionsDict.getDictionaryObject(key);
                        if (jsDict != null) {
                            String jsContent = jsDict.getString(COSName.JS);
                            if (containsMaliciousJavaScript(jsContent)) {
                                System.err.println(
                                        "Potentially malicious JavaScript found in form field: " + jsContent);
                                throw new IOException("The provided PDF contains potentially malicious content.");
                            }
                        }
                    }
                }
            }
        }
    }
}

private static final Pattern MALICIOUS_JS_PATTERN = Pattern.compile(
        "(?i)\\b(alert|confirm|prompt|eval|open|window\\.open|document\\.write|document\\.writeln|location\\.href|location\\.assign|location\\.replace|iframe|script|eval|execScript|expression|innerHTML|outerHTML|style|setAttribute|addEventListener|on[a-z]+)\\b");

private boolean containsMaliciousJavaScript(String jsContent) {
    Matcher matcher = MALICIOUS_JS_PATTERN.matcher(jsContent);
    return matcher.find();
}

I'm working on a method to validate whether a PDF contains embedded JavaScript. I expected the validation method to correctly detect the JavaScript in the PDF, but it’s not working as intended.

Could anyone provide guidance on:

1.Proper dependencies required for validating JavaScript in PDFs? 2.Example code for detecting embedded JavaScript in PDFs? Any help or suggestions would be greatly appreciated. Thanks!

"suspicious external references" - which external references are "suspicious" and which are not? "JavaScript in annotations if it’s potentially malicious" - what makes pieces of JavaScript "potentially malicious" and others not? "harmful JavaScript" / "potentially harmful JavaScript" - what JavaScript is "harmful" and what not? — mkl
– mkl, Commented Sep 1, 2024 at 11:00
I remember a discussion on the pdfbox users mailing list years ago, use the search feature and search for "Roberto Nibali". IIRC we discussed the many possibilities where there could be JS content :-( lists.apache.org/[email protected]:lte=1M:nibali and lists.apache.org/thread/mblzfph7j5wgob74r0z22hql5s5gjzfl — Tilman Hausherr
– Tilman Hausherr, Commented Sep 1, 2024 at 12:56

Collectives™ on Stack Overflow

How can I validate external references in a PDF using Apache PDFBox?

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest