2

I want to transform a valid HTML with not very deep level of nesting into another HTML with more restricted rules.

Only the following tags are supported in the resulting HTML:

<b></b>, <strong></strong>, <i></i>, <em></em>, <a
href="URL"></a>, <code></code>, <pre></pre>

Nested tags are not allowed at all.

For the rest of the tags and their combinations I have to create some sort of rules how to handle each. So I have to convert something like:

<p>text</p> into simple string text with linebreak,

<b>text <a href="url">link</a> text</b> into text link text

<a href="url">text<code> code here</code></a> into <a href="url">text code here</a> because <code> is nested inside <a> and so on.

For example HTML (linebreaks are only for convenience):

<p>long paragraph <a href="url">link</a> </p>
<p>another text <pre><code>my code block</code></pre> the rest of description</p>
<p><code>inline monospaced text with <a href="url">link</a></code></p>

Should be transformed into:

long paragraph <a href="url">link</a>

another text <code>my code block</code> the rest of description

<code>inline monospaced text with link</code>

Any suggestion on the way to solve that?

1 Answer 1

3

After some investigation, I have found a pretty elegant solution in my opinion. It's based on tagsoup library. The library has Text.HTML.TagSoup.Tree module which helps to parse HTML into tree structure.

It also contains transformTree function which does transformation pretty trivial. Documentation of that function says:

This operation is based on the Uniplate transform function. Given a list of trees, it applies the function to every tree in a bottom-up manner.

You can read about Uniplate more here.

This is the code I was satisfied with:

import Text.HTML.TagSoup
import Text.HTML.TagSoup.Tree

convert = transformTree f
    where
      f (TagLeaf (TagOpen "br" _)) = [TagLeaf (TagText "\n")] -- line breaks
      f (TagLeaf (TagOpen _ _)) = [] -- ignore all tags without closing pairs
      f (TagBranch "a" attrs inner) = tagExtr "a" attrs inner -- keeps href for <a>
      f (TagBranch "p" _ inner) = inner ++ [(TagLeaf (TagText "\n"))]
      f (TagBranch "pre" _ [TagBranch "code" _ inner]) = tagExtr "pre" [] inner -- <pre><code> -> <code>
      f (TagBranch tag _ inner) = if tag `elem` allowedTags then tagExtr tag [] inner else inner
      f x = [x]

tagExtr tag attrs inner = [TagBranch tag attrs [(extractFrom inner)]]

allowedTags = ["b", "i", "a", "code", "a", "pre", "em", "strong"]

extractFrom x = TagLeaf $ TagText $ (innerText . flattenTree) x
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.