0

I am new to Java programming. My current project requires me to read embedded(ole) files in an excel sheet and get text contents in them. Examples for reading embedded word file worked fine, however I am unable to find help reading an embedded pdf file. Tried few things by looking at similar examples.... which didn't work out.

http://poi.apache.org/spreadsheet/quick-guide.html#Embedded

I have code below, probably with help I can get in right direction. I have used Apache POI to read embedded files in excel and pdfbox to parse pdf data.

public class ReadExcel1 {

public static void main(String[] args) {

    try {

        FileInputStream file = new FileInputStream(new File("C:\\test.xls"));

        POIFSFileSystem fs = new POIFSFileSystem(file);
        HSSFWorkbook workbook = new HSSFWorkbook(fs);

        for (HSSFObjectData obj : workbook.getAllEmbeddedObjects()) {

            String oleName = obj.getOLE2ClassName();

           if(oleName.equals("Acrobat Document")){
                System.out.println("Acrobat reader document");

                try{
                    DirectoryNode dn = (DirectoryNode) obj.getDirectory();
                    for (Iterator<Entry> entries = dn.getEntries(); entries.hasNext();) {

                        DocumentEntry nativeEntry = (DocumentEntry) dn.getEntry("CONTENTS");
                        byte[] data = new byte[nativeEntry.getSize()];

                        ByteArrayInputStream bao= new ByteArrayInputStream(data);
                        PDFParser pdfparser = new PDFParser(bao);

                        pdfparser.parse();
                        COSDocument cosDoc = pdfparser.getDocument();
                        PDFTextStripper pdfStripper = new PDFTextStripper();
                        PDDocument pdDoc = new PDDocument(cosDoc);
                        pdfStripper.setStartPage(1);
                        pdfStripper.setEndPage(2);
                        System.out.println("Text from the pdf "+pdfStripper.getText(pdDoc));
                    }
                }catch(Exception e){
                    System.out.println("Error reading "+ e.getMessage());
                }finally{
                    System.out.println("Finally ");
                }
            }else{
                System.out.println("nothing ");
            }
        }

        file.close();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

}

Below is the output in eclipse

Acrobat reader document

Error reading Error: End-of-File, expected line Finally nothing

5
  • The first thing, which looks strange is the dn.getEntry("CONTENTS") - the PDF should be in some DirectoryNode called MBD... (see my other answer for more details) ... I guess, you are accessing some empty stream ... can you provide a sample Excel file?! Commented Aug 26, 2013 at 17:34
  • Did you try reading the Apache POI embedded documents documentation? Commented Aug 26, 2013 at 22:18
  • @kiwiwings I do see "MBD" entries in DirectoryNode which doesn't have any data in it. dn.getEntry("CONTENTS") gives me data with size more than 10000, so assumption was data is available in that particular entry. Commented Aug 27, 2013 at 10:37
  • @James Shaji If you would upload a sample file, I can get my hands on. I'll have to try if you get the data without further processing from the HSSFObjectData or if one has to use the POIFS entry to retrieve the data. Furthermore there can be a difference between embedded and (OLE 1.0)-packaged objects, so it's simply easier to find out with a real file (and not just theoretical hinting ...) Commented Aug 27, 2013 at 11:19
  • @kiwiwings I have uploaded the excel sheet to jamesshaji.com/sample.xls Commented Aug 27, 2013 at 14:23

1 Answer 1

1

The PDF weren't OLE 1.0 packaged, but somehow differently embedded - at least the extraction worked for me. This is not a general solution, because it depends on how the embedding application names the entries ... of course for PDFs you could check all DocumentNode-s for the magic number "%PDF" - and in case of OLE 1.0 packaged elements this needs to be done differently ...

I think, the real filename of the pdf is somewhere hidden in the \1Ole or CompObj entries, but for the example and apparently for your use case that's not necessary to determine.

import java.io.*;
import java.net.URL;
import org.apache.poi.hssf.usermodel.*;
import org.apache.poi.poifs.filesystem.*;
import org.apache.poi.util.IOUtils;

public class EmbeddedPdfInExcel {
    public static void main(String[] args) throws Exception {
        NPOIFSFileSystem fs = new NPOIFSFileSystem(new URL("http://jamesshaji.com/sample.xls").openStream());
        HSSFWorkbook wb = new HSSFWorkbook(fs.getRoot(), true);
        for (HSSFObjectData obj : wb.getAllEmbeddedObjects()) {
            String oleName = obj.getOLE2ClassName();
            DirectoryNode dn = (DirectoryNode)obj.getDirectory();
            if(oleName.contains("Acro") && dn.hasEntry("CONTENTS")){
                InputStream is = dn.createDocumentInputStream("CONTENTS");
                FileOutputStream fos = new FileOutputStream(obj.getDirectory().getName()+".pdf");
                IOUtils.copy(is, fos);
                fos.close();
                is.close();
            }
        }
        fs.close();
    }
}
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks kiwiwings!! Where can I find documentation to help me understand file structure?
Do you really want to read through the MS specs??? The are a two specs to go through: the OLE structures, the binary xls and for the other office formats, you'll find the specs close by the 2nd link

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.