I'm working on a project where I need to validate external references in PDF files using Apache PDFBox. Specifically, I want to check if a PDF contains any external references, such as links to external URLs, and ensure these references are safe and do not point to potentially malicious content.
I am using Apache PDFBox version 2.0.32. Here's what I'm trying to achieve:
Identify External References: I want to detect any external links or references in the PDF. Validate Content: Check if these references lead to potentially harmful or unexpected content. I have tried the following approaches:
Using the PDAnnotation class to get annotations, but I am not sure how to specifically identify external links or references from these annotations. Reviewing the PDFBox documentation and sample code, but I couldn’t find detailed guidance on handling external references. Could someone provide guidance or sample code on how to validate external references in a PDF using Apache PDFBox? Any help or pointers to relevant documentation would be greatly appreciated.
What I Tried:
I have implemented a method to validate PDF files for potentially malicious content using Apache PDFBox. My implementation includes the following checks:
- Document-Level JavaScript: I check for JavaScript in the document catalog using the
checkDocumentLevelJavaScriptmethod. - Annotations for JavaScript: I check if any annotations contain or trigger JavaScript actions using the
checkAnnotationsForJavaScriptmethod. - Form Fields for JavaScript: I verify if form fields include JavaScript using the
checkFormFieldsForJavaScriptmethod. - External References: I examine file attachments and annotations for suspicious URLs or external references using the
checkExternalReferencesmethod.
Here is a snippet of my code:
public void validatePDF(byte[] fileContent) {
try (ByteArrayInputStream bais = new ByteArrayInputStream(fileContent)) {
PDDocument document = PDDocument.load(bais);
// Check document-level JavaScript
checkDocumentLevelJavaScript(document);
// Check annotations for JavaScript
checkAnnotationsForJavaScript(document);
// Check form fields for JavaScript
checkFormFieldsForJavaScript(document);
// Check ExternalReferences for JavaScript
checkExternalReferences(document);
document.close();
System.out.println("PDF content validated successfully.");
} catch (IOException e) {
System.err.println("The provided PDF content is invalid.");
}
}
private void checkExternalReferences(PDDocument document) throws IOException {
// Check file attachments (embedded files)
PDDocumentCatalog catalog = document.getDocumentCatalog();
if (catalog.getNames() != null) {
List<PDDocumentCatalog.NamedObject> embeddedFiles = catalog.getNames().getEmbeddedFiles();
if (embeddedFiles != null) {
for (PDDocumentCatalog.NamedObject embeddedFile : embeddedFiles) {
String filename = embeddedFile.getName();
if (filename != null && !filename.isEmpty()) {
if (filename.contains("suspicious")) {
throw new SecurityException("Suspicious file attachment detected: " + filename);
}
}
}
}
}
// Check annotations for URLs or external references
for (PDPage page : document.getPages()) {
for (PDAnnotation annotation : page.getAnnotations()) {
if (annotation.getAction() != null) {
String action = annotation.getAction().toString();
if (action.contains("http") || action.contains("www")) {
if (action.contains("suspicious")) {
throw new SecurityException("Suspicious URL detected in annotation: " + action);
}
}
}
}
}
// Check form fields for URLs or external references
PDAcroForm acroForm = catalog.getAcroForm();
if (acroForm != null) {
for (PDField field : acroForm.getFields()) {
if (field.getActions() != null) {
String action = field.getActions().toString();
if (action.contains("http") || action.contains("www")) {
if (action.contains("suspicious")) {
throw new SecurityException("Suspicious URL detected in form field: " + action);
}
}
}
}
}
}
private void checkAnnotationsForJavaScript(PDDocument document) throws IOException {
for (PDPage page : document.getPages()) {
for (PDAnnotation annotation : page.getAnnotations()) {
String annotationContent = annotation.getContents();
if (annotationContent != null && containsMaliciousJavaScript(annotationContent)) {
System.err.println("Potentially malicious JavaScript found in "
+ annotation.getClass().getSimpleName() + " annotation: " + annotationContent);
throw new IOException("The provided PDF contains potentially malicious content.");
}
if (annotation instanceof PDAnnotationLink) {
PDAnnotationLink link = (PDAnnotationLink) annotation;
PDAction action = link.getAction();
if (action instanceof PDActionJavaScript) {
PDActionJavaScript jsAction = (PDActionJavaScript) action;
String jsContent = jsAction.getAction();
if (containsMaliciousJavaScript(jsContent)) {
System.err.println("Potentially malicious JavaScript found in "
+ annotation.getClass().getSimpleName() + " annotation's action: " + jsContent);
throw new IOException("The provided PDF contains potentially malicious content.");
}
}
}
}
}
}
private void checkDocumentLevelJavaScript(PDDocument document) throws IOException {
COSDictionary catalog = document.getDocumentCatalog().getCOSObject();
COSDictionary names = (COSDictionary) catalog.getDictionaryObject(COSName.NAMES);
if (names != null) {
COSDictionary jsNameTree = (COSDictionary) names.getDictionaryObject(COSName.JAVA_SCRIPT);
if (jsNameTree != null) {
for (COSName key : jsNameTree.keySet()) {
COSDictionary jsDict = (COSDictionary) jsNameTree.getDictionaryObject(key);
if (jsDict != null && jsDict.containsKey(COSName.JS)) {
String jsContent = jsDict.getString(COSName.JS);
if (containsMaliciousJavaScript(jsContent)) {
System.err.println("Potentially malicious JavaScript found in document-level JavaScript: "
+ jsContent);
throw new IOException("The provided PDF contains potentially malicious content.");
}
}
}
}
}
}
private void checkFormFieldsForJavaScript(PDDocument document) throws IOException {
PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
if (acroForm != null) {
for (PDField field : acroForm.getFields()) {
if (field.getActions() != null) {
COSDictionary actionsDict = field.getActions().getCOSObject();
for (COSName key : actionsDict.keySet()) {
if (COSName.JS.equals(key)) {
COSDictionary jsDict = (COSDictionary) actionsDict.getDictionaryObject(key);
if (jsDict != null) {
String jsContent = jsDict.getString(COSName.JS);
if (containsMaliciousJavaScript(jsContent)) {
System.err.println(
"Potentially malicious JavaScript found in form field: " + jsContent);
throw new IOException("The provided PDF contains potentially malicious content.");
}
}
}
}
}
}
}
}
private static final Pattern MALICIOUS_JS_PATTERN = Pattern.compile(
"(?i)\\b(alert|confirm|prompt|eval|open|window\\.open|document\\.write|document\\.writeln|location\\.href|location\\.assign|location\\.replace|iframe|script|eval|execScript|expression|innerHTML|outerHTML|style|setAttribute|addEventListener|on[a-z]+)\\b");
private boolean containsMaliciousJavaScript(String jsContent) {
Matcher matcher = MALICIOUS_JS_PATTERN.matcher(jsContent);
return matcher.find();
}
I'm working on a method to validate whether a PDF contains embedded JavaScript. I expected the validation method to correctly detect the JavaScript in the PDF, but it’s not working as intended.
Could anyone provide guidance on:
1.Proper dependencies required for validating JavaScript in PDFs? 2.Example code for detecting embedded JavaScript in PDFs? Any help or suggestions would be greatly appreciated. Thanks!