Regex to strip HTML tags

Question

I have this HTML input:

<font size="5"><p>some text</p>
<p> another text</p></font>

I'd like to use regex to remove the HTML tags so that the output is:

some text
another text

Can anyone suggest how to do this with regex?

Don't try to parse HTML with regular expressions. It will only end in tears. — Jon Skeet
– Jon Skeet, Commented Nov 2, 2010 at 7:44
Please read this answer to a similar question: stackoverflow.com/questions/1732348/… — Sean Patrick Floyd
– Sean Patrick Floyd, Commented Nov 2, 2010 at 7:48
Further Reading: stackoverflow.com/questions/832620/stripping-html-tags-in-java — Andreas Dolk
– Andreas Dolk, Commented Nov 2, 2010 at 8:23

aioobe · Accepted Answer · 2017-06-02 21:18:29Z

46

Since you asked, here's a quick and dirty solution:

String stripped = input.replaceAll("<[^>]*>", "");

(Ideone.com demo)

Using regexps to deal with HTML is a pretty bad idea though. The above hack won't deal with stuff like

<tag attribute=">">Hello</tag>
<script>if (a < b) alert('Hello>');</script>

etc.

A better approach would be to use for instance Jsoup. To remove all tags from a string, you can for instance do Jsoup.parse(html).text().

edited Jun 2, 2017 at 21:18

answered Nov 2, 2010 at 7:45

aioobe

423k115 gold badges831 silver badges844 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Gumbo Over a year ago

The > is allowed as a literal character in quoted attribute values.

ADIT Over a year ago

Before this tag I hade Head , tilte all those things are there by using above snippet I am getting head,titile text also.i need only this part of text only I tried with

ADIT Over a year ago

private static final Pattern BetweenTags = Pattern.compile("<p>([^<]+?)</p>+");

aioobe Over a year ago

Ok, if it was something as simple, as stripping tags in uncomplicated HTML, I may have chosen to go with a regexp. In your scenario, I believe that you're better off with a proper parser.

BjornS Over a year ago

May I suggest input.replaceAll("<[^>]+>","");

Community · Accepted Answer · 2020-06-20 09:12:55Z

9

Use a HTML parser. Here's a Jsoup example.

String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = Jsoup.parse(input).text();
System.out.println(stripped);

Result:

some text another text

Or if you want to preserve newlines:

String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
for (String line : input.split("\n")) {
    String stripped = Jsoup.parse(line).text();
    System.out.println(stripped);
}

Result:

some text
another text

Jsoup offers more advantages as well. You could easily extract specific parts of the HTML document using the select() method which accepts jQuery-like CSS selectors. It only requires the document to be semantically well-formed. The presence of the since 1998 deprecated <font> tag is already not a very good indication, but if you know the HTML structure in depth detail beforehand, it'll still be doable.

1 Comment

Johncl Over a year ago

Just a note that using Jsoup actually does not only strips away html tags, it adds spaces as well to separate elements. The text letter count will be greater than that of the html text e.g. written in tinymce editor if that is why you need to strip away the tags.

Prabhakaran · Accepted Answer · 2010-11-02 14:38:00Z

4

You can go with HTML parser called Jericho Html parser.

you can download it from here - http://jericho.htmlparser.net/docs/index.html

Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognized or invalid HTML. It also provides high-level HTML form manipulation functions.

The presence of badly formatted HTML does not interfere with the parsing

answered Nov 2, 2010 at 14:38

Prabhakaran

1692 silver badges12 bronze badges

1 Comment

sproketboy Over a year ago

Jsoup expects well formed HTML so it's NOT better than Jericho when you are dealing with arbitrary HTML.

Alexis Dufrenoy · Accepted Answer · 2014-01-06 14:39:40Z

Starting from aioobe's code, I tried something more daring:

String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = input.replaceAll("</?(font|p){1}.*?/?>", "");
System.out.println(stripped);

The code to strip every HTML tag would look like this:

public class HtmlSanitizer {

    private static String pattern;

    private final static String [] tagsTab = {"!doctype","a","abbr","acronym","address","applet","area","article","aside","audio","b","base","basefont","bdi","bdo","bgsound","big","blink","blockquote","body","br","button","canvas","caption","center","cite","code","col","colgroup","content","data","datalist","dd","decorator","del","details","dfn","dir","div","dl","dt","element","em","embed","fieldset","figcaption","figure","font","footer","form","frame","frameset","h1","h2","h3","h4","h5","h6","head","header","hgroup","hr","html","i","iframe","img","input","ins","isindex","kbd","keygen","label","legend","li","link","listing","main","map","mark","marquee","menu","menuitem","meta","meter","nav","nobr","noframes","noscript","object","ol","optgroup","option","output","p","param","plaintext","pre","progress","q","rp","rt","ruby","s","samp","script","section","select","shadow","small","source","spacer","span","strike","strong","style","sub","summary","sup","table","tbody","td","template","textarea","tfoot","th","thead","time","title","tr","track","tt","u","ul","var","video","wbr","xmp"};

    static {
        StringBuffer tags = new StringBuffer();
        for (int i=0;i<tagsTab.length;i++) {
            tags.append(tagsTab[i].toLowerCase()).append('|').append(tagsTab[i].toUpperCase());
            if (i<tagsTab.length-1) {
                tags.append('|');
            }
        }
        pattern = "</?("+tags.toString()+"){1}.*?/?>";
    }

    public static String sanitize(String input) {
        return input.replaceAll(pattern, "");
    }

    public final static void main(String[] args) {
        System.out.println(HtmlSanitizer.pattern);

        System.out.println(HtmlSanitizer.sanitize("<font size=\"5\"><p>some text</p><br/> <p>another text</p></font>"));
    }

}

I wrote this in order to be Java 1.4 compliant, for some sad reasons, so feel free to use for each and StringBuilder...

Advantages:

You can generate lists of tags you want to strip, which means you can keep those you want
You avoid stripping stuff that isn't an HTML tag
You keep the whitespaces

Drawbacks:

You have to list all HTML tags you want to strip from your string. Which can be a lot, for example if you want to strip everything.

If you see any other drawbacks, I would really be glad to know them.

Fabiano Francesconi · Accepted Answer · 2011-01-02 15:48:39Z

2

If you use Jericho, then you just have to use something like this:

public String extractAllText(String htmlText){
    Source source = new Source(htmlText);
    return source.getTextExtractor().toString();
}

Of course you can do the same even with an Element:

for (Element link : links) {
  System.out.println(link.getTextExtractor().toString());
}

answered Jan 2, 2011 at 15:48

Fabiano Francesconi

1,7701 gold badge20 silver badges35 bronze badges

Collectives™ on Stack Overflow

Regex to strip HTML tags

5 Answers 5

5 Comments

See also:

1 Comment

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

5 Comments

See also:

1 Comment

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related