0

I'm working on a project where I need to validate external references in PDF files using Apache PDFBox. Specifically, I want to check if a PDF contains any external references, such as links to external URLs, and ensure these references are safe and do not point to potentially malicious content.

I am using Apache PDFBox version 2.0.32. Here's what I'm trying to achieve:

Identify External References: I want to detect any external links or references in the PDF. Validate Content: Check if these references lead to potentially harmful or unexpected content. I have tried the following approaches:

Using the PDAnnotation class to get annotations, but I am not sure how to specifically identify external links or references from these annotations. Reviewing the PDFBox documentation and sample code, but I couldn’t find detailed guidance on handling external references. Could someone provide guidance or sample code on how to validate external references in a PDF using Apache PDFBox? Any help or pointers to relevant documentation would be greatly appreciated.

What I Tried:

I have implemented a method to validate PDF files for potentially malicious content using Apache PDFBox. My implementation includes the following checks:

  1. Document-Level JavaScript: I check for JavaScript in the document catalog using the checkDocumentLevelJavaScript method.
  2. Annotations for JavaScript: I check if any annotations contain or trigger JavaScript actions using the checkAnnotationsForJavaScript method.
  3. Form Fields for JavaScript: I verify if form fields include JavaScript using the checkFormFieldsForJavaScript method.
  4. External References: I examine file attachments and annotations for suspicious URLs or external references using the checkExternalReferences method.

Here is a snippet of my code:

public void validatePDF(byte[] fileContent) {
    try (ByteArrayInputStream bais = new ByteArrayInputStream(fileContent)) {
        PDDocument document = PDDocument.load(bais);

        // Check document-level JavaScript
        checkDocumentLevelJavaScript(document);

        // Check annotations for JavaScript
        checkAnnotationsForJavaScript(document);

        // Check form fields for JavaScript
        checkFormFieldsForJavaScript(document);

        // Check ExternalReferences for JavaScript
        checkExternalReferences(document);

        document.close();
        System.out.println("PDF content validated successfully.");
    } catch (IOException e) {
        System.err.println("The provided PDF content is invalid.");
    }
}

private void checkExternalReferences(PDDocument document) throws IOException {
    // Check file attachments (embedded files)
    PDDocumentCatalog catalog = document.getDocumentCatalog();
    if (catalog.getNames() != null) {
        List<PDDocumentCatalog.NamedObject> embeddedFiles = catalog.getNames().getEmbeddedFiles();
        if (embeddedFiles != null) {
            for (PDDocumentCatalog.NamedObject embeddedFile : embeddedFiles) {
                String filename = embeddedFile.getName();
                if (filename != null && !filename.isEmpty()) {
                    if (filename.contains("suspicious")) {
                        throw new SecurityException("Suspicious file attachment detected: " + filename);
                    }
                }
            }
        }
    }

    // Check annotations for URLs or external references
    for (PDPage page : document.getPages()) {
        for (PDAnnotation annotation : page.getAnnotations()) {
            if (annotation.getAction() != null) {
                String action = annotation.getAction().toString();
                if (action.contains("http") || action.contains("www")) {
                    if (action.contains("suspicious")) {
                        throw new SecurityException("Suspicious URL detected in annotation: " + action);
                    }
                }
            }
        }
    }

    // Check form fields for URLs or external references
    PDAcroForm acroForm = catalog.getAcroForm();
    if (acroForm != null) {
        for (PDField field : acroForm.getFields()) {
            if (field.getActions() != null) {
                String action = field.getActions().toString();
                if (action.contains("http") || action.contains("www")) {
                    if (action.contains("suspicious")) {
                        throw new SecurityException("Suspicious URL detected in form field: " + action);
                    }
                }
            }
        }
    }
}

private void checkAnnotationsForJavaScript(PDDocument document) throws IOException {
    for (PDPage page : document.getPages()) {
        for (PDAnnotation annotation : page.getAnnotations()) {
            String annotationContent = annotation.getContents();
            if (annotationContent != null && containsMaliciousJavaScript(annotationContent)) {
                System.err.println("Potentially malicious JavaScript found in "
                        + annotation.getClass().getSimpleName() + " annotation: " + annotationContent);
                throw new IOException("The provided PDF contains potentially malicious content.");
            }

            if (annotation instanceof PDAnnotationLink) {
                PDAnnotationLink link = (PDAnnotationLink) annotation;
                PDAction action = link.getAction();

                if (action instanceof PDActionJavaScript) {
                    PDActionJavaScript jsAction = (PDActionJavaScript) action;
                    String jsContent = jsAction.getAction();
                    if (containsMaliciousJavaScript(jsContent)) {
                        System.err.println("Potentially malicious JavaScript found in "
                                + annotation.getClass().getSimpleName() + " annotation's action: " + jsContent);
                        throw new IOException("The provided PDF contains potentially malicious content.");
                    }
                }
            }
        }
    }
}

private void checkDocumentLevelJavaScript(PDDocument document) throws IOException {
    COSDictionary catalog = document.getDocumentCatalog().getCOSObject();
    COSDictionary names = (COSDictionary) catalog.getDictionaryObject(COSName.NAMES);
    if (names != null) {
        COSDictionary jsNameTree = (COSDictionary) names.getDictionaryObject(COSName.JAVA_SCRIPT);
        if (jsNameTree != null) {
            for (COSName key : jsNameTree.keySet()) {
                COSDictionary jsDict = (COSDictionary) jsNameTree.getDictionaryObject(key);
                if (jsDict != null && jsDict.containsKey(COSName.JS)) {
                    String jsContent = jsDict.getString(COSName.JS);
                    if (containsMaliciousJavaScript(jsContent)) {
                        System.err.println("Potentially malicious JavaScript found in document-level JavaScript: "
                                + jsContent);
                        throw new IOException("The provided PDF contains potentially malicious content.");
                    }
                }
            }
        }
    }
}

private void checkFormFieldsForJavaScript(PDDocument document) throws IOException {
    PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
    if (acroForm != null) {
        for (PDField field : acroForm.getFields()) {
            if (field.getActions() != null) {
                COSDictionary actionsDict = field.getActions().getCOSObject();
                for (COSName key : actionsDict.keySet()) {
                    if (COSName.JS.equals(key)) {
                        COSDictionary jsDict = (COSDictionary) actionsDict.getDictionaryObject(key);
                        if (jsDict != null) {
                            String jsContent = jsDict.getString(COSName.JS);
                            if (containsMaliciousJavaScript(jsContent)) {
                                System.err.println(
                                        "Potentially malicious JavaScript found in form field: " + jsContent);
                                throw new IOException("The provided PDF contains potentially malicious content.");
                            }
                        }
                    }
                }
            }
        }
    }
}

private static final Pattern MALICIOUS_JS_PATTERN = Pattern.compile(
        "(?i)\\b(alert|confirm|prompt|eval|open|window\\.open|document\\.write|document\\.writeln|location\\.href|location\\.assign|location\\.replace|iframe|script|eval|execScript|expression|innerHTML|outerHTML|style|setAttribute|addEventListener|on[a-z]+)\\b");

private boolean containsMaliciousJavaScript(String jsContent) {
    Matcher matcher = MALICIOUS_JS_PATTERN.matcher(jsContent);
    return matcher.find();
}

I'm working on a method to validate whether a PDF contains embedded JavaScript. I expected the validation method to correctly detect the JavaScript in the PDF, but it’s not working as intended.

Could anyone provide guidance on:

1.Proper dependencies required for validating JavaScript in PDFs? 2.Example code for detecting embedded JavaScript in PDFs? Any help or suggestions would be greatly appreciated. Thanks!

2
  • 2
    "suspicious external references" - which external references are "suspicious" and which are not? "JavaScript in annotations if it’s potentially malicious" - what makes pieces of JavaScript "potentially malicious" and others not? "harmful JavaScript" / "potentially harmful JavaScript" - what JavaScript is "harmful" and what not? Commented Sep 1, 2024 at 11:00
  • I remember a discussion on the pdfbox users mailing list years ago, use the search feature and search for "Roberto Nibali". IIRC we discussed the many possibilities where there could be JS content :-( lists.apache.org/[email protected]:lte=1M:nibali and lists.apache.org/thread/mblzfph7j5wgob74r0z22hql5s5gjzfl Commented Sep 1, 2024 at 12:56

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.