1

I'm trying to do a custom HTML to LaTeX converter, which uses Wordpress posts as a source.

Basically, it needs to do some "replacing", like:

<h2>H2 Title</h2>
<p>Text text text</p>
<img src="/image.png" alt="Image ALT tag" \>

To this

   \begin{document}

   \section{H2 Title}

   Text text text

   \shorthandoff{=}
   \begin{figure}[H]
   \centering
   \includegraphics[scale=0.7]{./img/image.png}
   \caption{Image ALT tag}
   \end{figure}
   \shorthandon{=}

   \end{document}

Which approach should I use? Is there a HTML DOM parser that allows replacements like this? Or other suggestions?

Update: Is there any way to walk properly in HTML DOM tree in PHP? I tried RecursiveDOMIterator (http://stackoverflow.com/questions/4431142/loop-through-all-elements-of-body-tags-using-dom) but I can't get a successfull result.

Thanks.

1

2 Answers 2

1

Have you tried PHP Simple HTML DOM Parser? Specifically, the "How to traverse the DOM tree?" section in the manual might be what you are looking for.

Sign up to request clarification or add additional context in comments.

Comments

1

Depending on how complicated the structure of the HTML in your posts is, you could use regular expression-based replacements (if the markup is fairly simple, as in your example). If you want to replicate complex structures (nested elements) into LaTeX, then regex likely wouldn't work.

1 Comment

Even if it is possible to parse the subset of HTML necessary for Hazar's task using regular expressions, it is still not advisable. This would quickly become unwieldy when dealing with attributes and would not give the tree-like data structure necessary to construct the LaTeX document.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.