Read embedded pdf file in excel using Java

Question

I am new to Java programming. My current project requires me to read embedded(ole) files in an excel sheet and get text contents in them. Examples for reading embedded word file worked fine, however I am unable to find help reading an embedded pdf file. Tried few things by looking at similar examples.... which didn't work out.

http://poi.apache.org/spreadsheet/quick-guide.html#Embedded

I have code below, probably with help I can get in right direction. I have used Apache POI to read embedded files in excel and pdfbox to parse pdf data.

public class ReadExcel1 {

public static void main(String[] args) {

    try {

        FileInputStream file = new FileInputStream(new File("C:\\test.xls"));

        POIFSFileSystem fs = new POIFSFileSystem(file);
        HSSFWorkbook workbook = new HSSFWorkbook(fs);

        for (HSSFObjectData obj : workbook.getAllEmbeddedObjects()) {

            String oleName = obj.getOLE2ClassName();

           if(oleName.equals("Acrobat Document")){
                System.out.println("Acrobat reader document");

                try{
                    DirectoryNode dn = (DirectoryNode) obj.getDirectory();
                    for (Iterator<Entry> entries = dn.getEntries(); entries.hasNext();) {

                        DocumentEntry nativeEntry = (DocumentEntry) dn.getEntry("CONTENTS");
                        byte[] data = new byte[nativeEntry.getSize()];

                        ByteArrayInputStream bao= new ByteArrayInputStream(data);
                        PDFParser pdfparser = new PDFParser(bao);

                        pdfparser.parse();
                        COSDocument cosDoc = pdfparser.getDocument();
                        PDFTextStripper pdfStripper = new PDFTextStripper();
                        PDDocument pdDoc = new PDDocument(cosDoc);
                        pdfStripper.setStartPage(1);
                        pdfStripper.setEndPage(2);
                        System.out.println("Text from the pdf "+pdfStripper.getText(pdDoc));
                    }
                }catch(Exception e){
                    System.out.println("Error reading "+ e.getMessage());
                }finally{
                    System.out.println("Finally ");
                }
            }else{
                System.out.println("nothing ");
            }
        }

        file.close();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

}

Below is the output in eclipse

Acrobat reader document

Error reading Error: End-of-File, expected line Finally nothing

The first thing, which looks strange is the dn.getEntry("CONTENTS") - the PDF should be in some DirectoryNode called MBD... (see my other answer for more details) ... I guess, you are accessing some empty stream ... can you provide a sample Excel file?! — kiwiwings
– kiwiwings, Commented Aug 26, 2013 at 17:34
Did you try reading the Apache POI embedded documents documentation? — Gagravarr
– Gagravarr, Commented Aug 26, 2013 at 22:18
@kiwiwings I do see "MBD" entries in DirectoryNode which doesn't have any data in it. dn.getEntry("CONTENTS") gives me data with size more than 10000, so assumption was data is available in that particular entry. — James Shaji
– James Shaji, Commented Aug 27, 2013 at 10:37
@James Shaji If you would upload a sample file, I can get my hands on. I'll have to try if you get the data without further processing from the HSSFObjectData or if one has to use the POIFS entry to retrieve the data. Furthermore there can be a difference between embedded and (OLE 1.0)-packaged objects, so it's simply easier to find out with a real file (and not just theoretical hinting ...) — kiwiwings
– kiwiwings, Commented Aug 27, 2013 at 11:19
@kiwiwings I have uploaded the excel sheet to jamesshaji.com/sample.xls — James Shaji
– James Shaji, Commented Aug 27, 2013 at 14:23

kiwiwings · Accepted Answer · 2013-08-27 23:18:28Z

1

The PDF weren't OLE 1.0 packaged, but somehow differently embedded - at least the extraction worked for me. This is not a general solution, because it depends on how the embedding application names the entries ... of course for PDFs you could check all DocumentNode-s for the magic number "%PDF" - and in case of OLE 1.0 packaged elements this needs to be done differently ...

I think, the real filename of the pdf is somewhere hidden in the \1Ole or CompObj entries, but for the example and apparently for your use case that's not necessary to determine.

import java.io.*;
import java.net.URL;
import org.apache.poi.hssf.usermodel.*;
import org.apache.poi.poifs.filesystem.*;
import org.apache.poi.util.IOUtils;

public class EmbeddedPdfInExcel {
    public static void main(String[] args) throws Exception {
        NPOIFSFileSystem fs = new NPOIFSFileSystem(new URL("http://jamesshaji.com/sample.xls").openStream());
        HSSFWorkbook wb = new HSSFWorkbook(fs.getRoot(), true);
        for (HSSFObjectData obj : wb.getAllEmbeddedObjects()) {
            String oleName = obj.getOLE2ClassName();
            DirectoryNode dn = (DirectoryNode)obj.getDirectory();
            if(oleName.contains("Acro") && dn.hasEntry("CONTENTS")){
                InputStream is = dn.createDocumentInputStream("CONTENTS");
                FileOutputStream fos = new FileOutputStream(obj.getDirectory().getName()+".pdf");
                IOUtils.copy(is, fos);
                fos.close();
                is.close();
            }
        }
        fs.close();
    }
}

answered Aug 27, 2013 at 23:18

kiwiwings

3,4661 gold badge24 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

James Shaji Over a year ago

Thanks kiwiwings!! Where can I find documentation to help me understand file structure?

kiwiwings Over a year ago

Do you really want to read through the MS specs??? The are a two specs to go through: the OLE structures, the binary xls and for the other office formats, you'll find the specs close by the 2nd link

Collectives™ on Stack Overflow

Read embedded pdf file in excel using Java

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related