java - Re-write the same text into an existing PDF document by using PDFBox -
it's important question , interested of you.
i used pdfbox create simple pdf document. i'am trying do, read existing document , re-write same text it, , in same position.
1) firstly create pdf named "musique.pdf".
2)read existing document.
3)extract text document pdftextstripper.
3)find position of each character in document (x, y, width, fs, etc. ).
4)create table must contain x , y of each character, example tabel1 [0]=x1 tabel1[1]=y1 , table1[2]=x2, table1[3]=y2 , etc.
5) create boucle of pdfcontentstream re-write each character in correct position.
the problem is:
the first line wrote problem second line.
"i notice if have example text formed of 3 lines , if assume contains 225 characters,,so if length of text, put length equal 231,,so can notice adds 2 spaces of end of each line,, when search position of each character, program not consider these added spaces"
please run below code , tell me how resolve problem, please.
my code until now:
/* * change template, choose tools | templates * , open template in editor. */ package test; import java.io.ioexception; import java.io.outputstream; import java.util.list; import org.apache.pdfbox.cos.cosinteger; import org.apache.pdfbox.cos.cosstream; import org.apache.pdfbox.cos.cosstring; import org.apache.pdfbox.exceptions.cosvisitorexception; import org.apache.pdfbox.pdfparser.pdfstreamparser; import org.apache.pdfbox.pdfwriter.contentstreamwriter; import org.apache.pdfbox.pdmodel.pddocument; import org.apache.pdfbox.pdmodel.pdpage; import org.apache.pdfbox.pdmodel.pdresources; import org.apache.pdfbox.pdmodel.common.pdrectangle; import org.apache.pdfbox.pdmodel.common.pdstream; import org.apache.pdfbox.pdmodel.edit.pdpagecontentstream; import org.apache.pdfbox.pdmodel.font.pdfont; import org.apache.pdfbox.pdmodel.font.pdtype1font; import org.apache.pdfbox.util.pdfoperator; import org.apache.pdfbox.util.pdftextstripper; import org.apache.pdfbox.util.textposition; public class test extends pdftextstripper{ private static final string src="..."; private static int i; private static float[] table1; private static pdpagecontentstream content; private static float jjj; public test() throws ioexception { super.setsortbyposition(true); } public static void createpdf(string src) throws ioexception, cosvisitorexception{ //create document named "musique.pdf" pdrectangle rec= new pdrectangle(400,400); pddocument document= null; document= new pddocument(); pdpage page= new pdpage(rec); document.addpage(page); pdfont font= pdtype1font.helvetica; pdpagecontentstream canvas1= new pdpagecontentstream(document,page,true,true); canvas1.setfont(font, 10); canvas1.begintext(); canvas1.appendrawcommands("15 385 td"); canvas1.appendrawcommands("(la musique est très importante dans notre vie moderne. sans la musique, non)tj\n"); canvas1.endtext(); canvas1.close(); pdpagecontentstream canvas2= new pdpagecontentstream(document,page,true,true); canvas2.setfont(font, 11); canvas2.begintext(); canvas2.appendrawcommands("15 370 td"); canvas2.appendrawcommands("(donc il est très necessaire de jouer chaque jours la musique.)tj\n"); canvas2.endtext(); canvas2.close(); document.save("musique.pdf"); document.close(); } /** * @param args command line arguments */ public static void main(string[] args) throws ioexception, cosvisitorexception { test tes= new test(); tes.createpdf(src); //read existing document pddocument doc; doc= pddocument.load("musique.pdf"); list pages = doc.getdocumentcatalog().getallpages(); pdpage page = (pdpage) pages.get(0); //extract text existed in document pdftextstripper stripper =new pdftextstripper(); string texte=stripper.gettext(doc); pdstream contents = page.getcontents(); if(contents!=null){ i=1; table1=new float[texte.length()*2]; table1[0]=(float)15.0; //the function below call processtextposition procedure in order find position of each character , put each value in case of table1 tes.processstream(page, page.findresources(), page.getcontents().getstream()); //after execution of processtextposition, analysing of code continue below code: int iii=0; int kkk=0; //create boucle of pdpagecontentstream in order re-write completly text in document //when run code, must notice problem second line, how resolve problem ? pdfont font= pdtype1font.helvetica; while(kkk<table1.length){ content = new pdpagecontentstream(doc,page,true,true); content.setfont(font, 10); content.begintext(); jjj = 400-table1[kkk+1]; content.appendrawcommands(""+table1[kkk]+" "+jjj+" td"); content.appendrawcommands("("+texte.charat(iii)+")"+" tj\n"); content.endtext(); content.close(); iii=iii+1; kkk=kkk+2; } } //save modified document doc.save("modified-musique.pdf"); doc.close(); } /** * @param text text processed */ public void processtextposition(textposition text) { system.out.println("string[" + text.getxdiradj() + "," + text.getydiradj() + " fs=" + text.getfontsize() + " xscale=" + text.getxscale() + " height=" + text.getheightdir() + " space=" + text.getwidthofspace() + " width=" + text.getwidthdiradj() + "]" + text.getcharacter()); if(i>1){ table1[i]=text.getxdiradj(); system.out.println(table1[i]); i=i+1; table1[i]=text.getydiradj(); system.out.println(table1[i]); i=i+1; } else{ table1[i]=text.getydiradj(); system.out.println(table1[i]); i=i+1; } } }
best regards,
liszt.
there shortcomings in concept , code.
first of concept: 2 items numbered 3:
3)extract text document pdftextstripper.
3)find position of each character in document (x, y, width, fs, etc. ).
separating these 2 steps in eyes bad idea because in general you'll have trouble recognizing respectively corresponding character text extraction , glyph content.
it difficult in general because e.g. e
glyph in content corresponds e
character in text? counting on order of appearance in content stream being identical order in parsed text works in simple page contents.
and there additional problems imposed replacements: e.g. text extraction quite expands ligatures , example gives ff
ff
.
additionally there matter of going , forth between font encoding , string encoding might quite lossy
furthermore text extraction adds white space characters text not present in content. e.g. can add line breaks recognized jump in y direction or space recognized jump in x direction.
btw, reason observation:
i notice if have example text formed of 3 lines , if assume contains 225 characters,,so if length of text, put length equal 231,,so can notice adds 2 spaces of end of each line,, when search position of each character, program not consider these added spaces.
furthermore code makes pdf size explode
5) create boucle of pdfcontentstream re-write each character in correct position.
while(kkk<table1.length){ content = new pdpagecontentstream(doc,page,true,true); ... }
i propose @ least creating single additional content stream...
what starting this:
// read existing document pddocument doc; doc = pddocument.load(musiquefilename); list<?> pages = doc.getdocumentcatalog().getallpages(); pdpage page = (pdpage) pages.get(0); pdpagecontentstream content = new pdpagecontentstream(doc, page, true, true); testrewriter rewriter = new testrewriter(content); rewriter.processstream(page, page.findresources(), page.getcontents().getstream()); content.close(); // save modified document doc.save(modifiedmusiquefilename); doc.close();
here testrewriter subclass of pdftextstripper, too:
public static class testrewriter extends pdftextstripper { final pdpagecontentstream canvas; public testrewriter(pdpagecontentstream canvas) throws ioexception { this.canvas = canvas; } /** * @param text * text processed */ public void processtextposition(textposition text) { try { pdfont font = pdtype1font.helvetica; canvas.setfont(font, 10); canvas.begintext(); canvas.appendrawcommands("" + (text.getxdiradj()) + " " + (400 - text.getydiradj()) + " td"); canvas.appendrawcommands("(" + text.getcharacter() + ")" + " tj\n"); canvas.endtext(); } catch(ioexception e) { e.printstacktrace(); } } }
this still far perfect may continue...
if in parallel need parse actual text, integrate more of pdftextstripper
method processtextposition
combine functionalities.
Comments
Post a Comment