How to replace text(tag) with HTML in docx using Apache POI?

Question

We are going to have some template docx file, where will be some tags like ${content}. I need to replace this tags with HTML.

For this purpose I want to use altChunk element in XWPFDocument. Following answer in How to add an altChunk element to a XWPFDocument using Apache POI, I could place altChunk in the end of docx.

How can I replace my tag with it? Or could I use any other libraries, may be docx4j?

UPD: Template docx files with tags are created by end users with MS Word and looks like:

"How can I replace my tag with it?" Depends on where the tag is. According to Office Open XML specification, altChunk can only occur in IBody elements. So if your ${content} is in a text run, then this cannot be replaced with altChunk. Only if ${content} is a IBodyElement of it's own, then finding that IBodyElement, creating a XMLCursor, inserting the altChunk, then removing the IBodyElement would be possible. — Axel Richter
– Axel Richter, Commented Dec 20, 2018 at 8:03
@AxelRichter, if this tag is written in MS Word, is it considered to be as IBodyElement? If not, do you know how to make it IBodyElement using MS Word? See my update, pls. May be I should not replace, may be I could place it just after tag, and then remove text with tag. Any ideas? — Max
– Max, Commented Dec 20, 2018 at 8:33
Looks as if it is it's own paragraph, so it is a IBodyElement. Will try providing a solution this evening (Germany). — Axel Richter
– Axel Richter, Commented Dec 20, 2018 at 9:13

Axel Richter · Accepted Answer · 2018-12-20 16:10:56Z

If "${content}" is in a IBodyElement of it's own, then solving that requirement by finding that IBodyElement, creating a XmlCursor, inserting the altChunk, then removing the IBodyElement would be possible.

The following code demonstrates this by extending the example in How to add an altChunk element to a XWPFDocument using Apache POI. It provides a method for replacing a found IBodyElement, which contains a special text, with a altChunk which references a MyXWPFHtmlDocument. It uses XmlCursor to get the needed position in the text body. The usage of XmlCursor is commented in the code.

template.docx:

Code:

import java.io.*;

import org.apache.poi.*;
import org.apache.poi.ooxml.*;
import org.apache.poi.openxml4j.opc.*;

import org.apache.poi.xwpf.usermodel.*;

import org.apache.xmlbeans.XmlCursor;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTAltChunk;

public class WordInsertHTMLaltChunkInDocument {

 //a method for creating the htmlDoc /word/htmlDoc#.html in the *.docx ZIP archive  
 //String id will be htmlDoc#.
 private static MyXWPFHtmlDocument createHtmlDoc(XWPFDocument document, String id) throws Exception {
  OPCPackage oPCPackage = document.getPackage();
  PackagePartName partName = PackagingURIHelper.createPartName("/word/" + id + ".html");
  PackagePart part = oPCPackage.createPart(partName, "text/html");
  MyXWPFHtmlDocument myXWPFHtmlDocument = new MyXWPFHtmlDocument(part, id);
  document.addRelation(myXWPFHtmlDocument.getId(), new XWPFHtmlRelation(), myXWPFHtmlDocument);
  return myXWPFHtmlDocument;
 }

 //a method for replacing a IBodyElement containing a special text with CTAltChunk which
 //references MyXWPFHtmlDocument
 private static void replaceIBodyElementWithAltChunk(XWPFDocument document, String textToFind, 
                                                     MyXWPFHtmlDocument myXWPFHtmlDocument) throws Exception {
  int pos = 0;
  for (IBodyElement bodyElement : document.getBodyElements()) {
   if (bodyElement instanceof XWPFParagraph) {
    XWPFParagraph paragraph = (XWPFParagraph)bodyElement;
    String text = paragraph.getText();
    if (text != null && text.contains(textToFind)) {
     //create XmlCursor at this paragraph
     XmlCursor cursor = paragraph.getCTP().newCursor();
     cursor.toEndToken(); //now we are at end of the paragraph
     //there always must be a next start token. Either a p or at least sectPr.
     while(cursor.toNextToken() != org.apache.xmlbeans.XmlCursor.TokenType.START);
     //now we can insert the CTAltChunk here
     String uri = CTAltChunk.type.getName().getNamespaceURI();
     cursor.beginElement("altChunk", uri);
     cursor.toParent();
     CTAltChunk cTAltChunk = (CTAltChunk)cursor.getObject();
     //set the altChunk's Id to reference the given MyXWPFHtmlDocument
     cTAltChunk.setId(myXWPFHtmlDocument.getId());

     //now remove the found IBodyElement
     document.removeBodyElement(pos);

     break; //break for each loop
    }
   }
   pos++;
  }
 }

 public static void main(String[] args) throws Exception {

  XWPFDocument document = new XWPFDocument(new FileInputStream("template.docx"));

  MyXWPFHtmlDocument myXWPFHtmlDocument = createHtmlDoc(document, "htmlDoc1");
  myXWPFHtmlDocument.setHtml(myXWPFHtmlDocument.getHtml().replace("<body></body>",
   "<body><p>Simple <b>HTML</b> <i>formatted</i> <u>text</u></p></body>"));

  replaceIBodyElementWithAltChunk(document, "${content}", myXWPFHtmlDocument);

  FileOutputStream out = new FileOutputStream("result.docx");
  document.write(out);
  out.close();
  document.close();

 }

 //a wrapper class for the  htmlDoc /word/htmlDoc#.html in the *.docx ZIP archive
 //provides methods for manipulating the HTML
 //TODO: We should *not* using String methods for manipulating HTML!
 private static class MyXWPFHtmlDocument extends POIXMLDocumentPart {

  private String html;
  private String id;

  private MyXWPFHtmlDocument(PackagePart part, String id) throws Exception {
   super(part);
   this.html = "<!DOCTYPE html><html><head><style></style><title>HTML import</title></head><body></body>";
   this.id = id;
  }

  private String getId() {
   return id;
  }

  private String getHtml() {
   return html;
  }

  private void setHtml(String html) {
   this.html = html;
  }

  @Override
  protected void commit() throws IOException {
   PackagePart part = getPackagePart();
   OutputStream out = part.getOutputStream();
   Writer writer = new OutputStreamWriter(out, "UTF-8");
   writer.write(html);
   writer.close();
   out.close();
  }

 }

 //the XWPFRelation for /word/htmlDoc#.html
 private final static class XWPFHtmlRelation extends POIXMLRelation {
  private XWPFHtmlRelation() {
   super(
    "text/html", 
    "http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk", 
    "/word/htmlDoc#.html");
  }
 }
}

result.docx:

Your solution works perfect. I have another question. As I have understood, AltChunks are converted by MS Word, when docx file is opened. Is there any way to do that using Apache POI? I have tried to use docx4j for this purpose. Loaded my docx to WordprocessingMLPackage, got MainDocumentPart from it and used method convertAltChunks(). But when I open it using LibreOffice, AltChunks are not vissible.
@Max:"Is there any way to convert altChunk into native Word document parts using Apache POI?": No, there is not.; "I have tried to use docx4j": This is another library. So if you have problems using docx4j library then ask a separate question about that.

Collectives™ on Stack Overflow

How to replace text(tag) with HTML in docx using Apache POI?

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related