Read embedded pdf file in excel using Java -
i new java programming. current project requires me read embedded(ole) files in excel sheet , text contents in them. examples reading embedded word file worked fine, unable find reading embedded pdf file. tried few things looking @ similar examples.... didn't work out.
http://poi.apache.org/spreadsheet/quick-guide.html#embedded
i have code below, can in right direction. have used apache poi read embedded files in excel , pdfbox parse pdf data.
public class readexcel1 { public static void main(string[] args) { try { fileinputstream file = new fileinputstream(new file("c:\\test.xls")); poifsfilesystem fs = new poifsfilesystem(file); hssfworkbook workbook = new hssfworkbook(fs); (hssfobjectdata obj : workbook.getallembeddedobjects()) { string olename = obj.getole2classname(); if(olename.equals("acrobat document")){ system.out.println("acrobat reader document"); try{ directorynode dn = (directorynode) obj.getdirectory(); (iterator<entry> entries = dn.getentries(); entries.hasnext();) { documententry nativeentry = (documententry) dn.getentry("contents"); byte[] data = new byte[nativeentry.getsize()]; bytearrayinputstream bao= new bytearrayinputstream(data); pdfparser pdfparser = new pdfparser(bao); pdfparser.parse(); cosdocument cosdoc = pdfparser.getdocument(); pdftextstripper pdfstripper = new pdftextstripper(); pddocument pddoc = new pddocument(cosdoc); pdfstripper.setstartpage(1); pdfstripper.setendpage(2); system.out.println("text pdf "+pdfstripper.gettext(pddoc)); } }catch(exception e){ system.out.println("error reading "+ e.getmessage()); }finally{ system.out.println("finally "); } }else{ system.out.println("nothing "); } } file.close(); } catch (filenotfoundexception e) { e.printstacktrace(); } catch (ioexception e) { e.printstacktrace(); } }
}
below output in eclipse
acrobat reader document
error reading error: end-of-file, expected line nothing
the pdf weren't ole 1.0 packaged, somehow differently embedded - @ least extraction worked me. not general solution, because depends on how embedding application names entries ... of course pdfs check documentnode
-s magic number "%pdf" - , in case of ole 1.0 packaged elements needs done differently ...
i think, real filename of pdf somewhere hidden in \1ole
or compobj
entries, example , apparently use case that's not necessary determine.
import java.io.*; import java.net.url; import org.apache.poi.hssf.usermodel.*; import org.apache.poi.poifs.filesystem.*; import org.apache.poi.util.ioutils; public class embeddedpdfinexcel { public static void main(string[] args) throws exception { npoifsfilesystem fs = new npoifsfilesystem(new url("http://jamesshaji.com/sample.xls").openstream()); hssfworkbook wb = new hssfworkbook(fs.getroot(), true); (hssfobjectdata obj : wb.getallembeddedobjects()) { string olename = obj.getole2classname(); directorynode dn = (directorynode)obj.getdirectory(); if(olename.contains("acro") && dn.hasentry("contents")){ inputstream = dn.createdocumentinputstream("contents"); fileoutputstream fos = new fileoutputstream(obj.getdirectory().getname()+".pdf"); ioutils.copy(is, fos); fos.close(); is.close(); } } fs.close(); } }
Comments
Post a Comment