how to convert HTML text to plain text? [duplicate]

Question

friend's I have to parse the description from url,where parsed content have few html tags,so how can I convert it to plain text.

What are your precise requirements? Do you need to strip HTML tags? Extract the content of a specific tag? — Vivien Barousse
– Vivien Barousse, Commented Aug 31, 2010 at 10:05
i can able to extract the content,but the content have zcc dsdfsf ddfdfsf sfdfdfdfdf, like the above i'm getting my data but i need to be a simple plain text.without those html tags — MGSenthil
– MGSenthil, Commented Aug 31, 2010 at 10:54
Similar question with good answer here : stackoverflow.com/questions/1518675/…. I used Jericho and it works fine. — рüффп
– рüффп, Commented Sep 3, 2013 at 9:49
Duplicate of stackoverflow.com/q/240546/873282, stackoverflow.com/q/1699313/873282, stackoverflow.com/q/1518675/873282, and stackoverflow.com/q/832620/873282 — koppor
– koppor, Commented Dec 11, 2016 at 21:45

Ranjit · Accepted Answer · 2019-03-15 09:01:39Z

43

Yes, Jsoup will be the better option. Just do like below to convert the whole HTML text to plain text.

String plainText= Jsoup.parse(yout_html_text).text();

answered Mar 15, 2019 at 9:01

Ranjit

5,2003 gold badges33 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

AvahW Over a year ago

To keep the line breaks you can now also use Jsoup.parse(html).wholeText()

vikramvi Over a year ago

This is not working for me, can you please check stackoverflow.com/questions/73861739/…

Sean Patrick Floyd · Accepted Answer · 2010-08-31 10:58:45Z

30

Just getting rid of HTML tags is simple:

// replace all occurrences of one or more HTML tags with optional
// whitespace inbetween with a single space character 
String strippedText = htmlText.replaceAll("(?s)<[^>]*>(\\s*<[^>]*>)*", " ");

But unfortunately the requirements are never that simple:

Usually,  and <div> elements need a separate handling, there may be cdata blocks with > characters (e.g. javascript) that mess up the regex etc.

answered Aug 31, 2010 at 10:58

Sean Patrick Floyd

301k72 gold badges481 silver badges598 bronze badges

2 Comments

Erwin Bolwidt Over a year ago

For some background on why this will not work for the general case, and won't be f(u|oo)l-proof: RegEx match open tags except XHTML self-contained tags

George Over a year ago

Love it... so simple, yet so powerful

demongolem · Accepted Answer · 2012-04-23 03:57:13Z

10

You can use this single line to remove the html tags and display it as plain text.

htmlString=htmlString.replaceAll("\\<.*?\\>", "");

edited Apr 23, 2012 at 3:57

demongolem

9,75436 gold badges97 silver badges107 bronze badges

answered Sep 3, 2010 at 10:16

Kandha

3,68912 gold badges37 silver badges50 bronze badges

Comments

xxx · Accepted Answer · 2021-01-05 05:45:10Z

8

Use Jsoup.

Add the dependency

<dependency>
  <!-- jsoup HTML parser library @ https://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.13.1</version>
</dependency>

Now in your java code:

public static String html2text(String html) {
        return Jsoup.parse(html).wholeText();
    }

Just call the method html2text with passing the html text and it will return plain text.

answered Jan 5, 2021 at 5:45

xxx

2032 silver badges7 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:18:10Z

5

Use a HTML parser like htmlCleaner

For detailed answer : How to remove HTML tag in Java

edited May 23, 2017 at 12:18

CommunityBot

11 silver badge

answered Aug 31, 2010 at 10:06

ankitjaininfo

12.4k7 gold badges54 silver badges76 bronze badges

Comments

mtb · Accepted Answer · 2016-11-14 14:41:55Z

2

If you want to parse like browser display, use:

import net.htmlparser.jericho.*;
import java.util.*;
import java.io.*;
import java.net.*;

public class RenderToText {
    public static void main(String[] args) throws Exception {
        String sourceUrlString="data/test.html";
        if (args.length==0)
          System.err.println("Using default argument of \""+sourceUrlString+'"');
        else
            sourceUrlString=args[0];
        if (sourceUrlString.indexOf(':')==-1) sourceUrlString="file:"+sourceUrlString;
        Source source=new Source(new URL(sourceUrlString));
        String renderedText=source.getRenderer().toString();
        System.out.println("\nSimple rendering of the HTML document:\n");
        System.out.println(renderedText);
  }
}

I hope this will help to parse table also in the browser format.

Thanks, Ganesh

edited Nov 14, 2016 at 14:41

mtb

1,36816 silver badges32 bronze badges

answered Nov 14, 2016 at 12:34

Ganesan Palanisamy

414 bronze badges

1 Comment

koppor Over a year ago

Can the downvoters please explain why they downvote?

Jon Freedman · Accepted Answer · 2010-08-31 10:07:22Z

1

I'd recommend parsing the raw HTML through jTidy which should give you output which you can write xpath expressions against. This is the most robust way I've found of scraping HTML.

answered Aug 31, 2010 at 10:07

Jon Freedman

9,7176 gold badges43 silver badges59 bronze badges

Comments

Stephen Rauch · Accepted Answer · 2018-10-04 01:35:11Z

0

I needed a plain text representation of some HTML which included FreeMarker tags. The problem was handed to me with a JSoup solution, but JSoup was escaping the FreeMarker tags, thus breaking the functionality. I also tried htmlCleaner (sourceforge), but that left the HTML header and style content (tags removed). http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726

My code:

return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();

The maxLineLength ensures lines are not artificially wrapped at 80 characters. The setNewLine(null) uses the same new line character(s) as the source.

edited Oct 4, 2018 at 1:35

Stephen Rauch♦

50.1k32 gold badges118 silver badges143 bronze badges

answered Oct 4, 2018 at 1:04

John Camerin

7226 silver badges16 bronze badges

Comments

Ruslanas · Accepted Answer · 2020-05-20 10:04:15Z

0

I use HTMLUtil.textFromHTML(value) from

<dependency>
    <groupId>org.clapper</groupId>
    <artifactId>javautil</artifactId>
    <version>3.2.0</version>
</dependency>

answered May 20, 2020 at 10:04

Ruslanas

616 bronze badges

Comments

Akshay More · Accepted Answer · 2021-01-12 21:25:53Z

0

Using Jsoup, I got all the text in the same line.

So I used the following block of code to parse HTML and keep new lines:

private String parseHTMLContent(String toString) {
    String result = toString.replaceAll("\\<.*?\\>", "\n");
    String previousResult = "";
    while(!previousResult.equals(result)){
        previousResult = result;
        result = result.replaceAll("\n\n","\n");
    }
    return result;
}

Not the best solution but solved my problem :)

answered Jan 12, 2021 at 21:25

Akshay More

565 bronze badges

Collectives™ on Stack Overflow

how to convert HTML text to plain text? [duplicate]

10 Answers 10

2 Comments

2 Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

2 Comments

2 Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Linked

Related