Remove HTML tags from a text string and keep the text

Question

I have a text string like the one below :-

^style>           
  p,span,li{font-family:Arial;font-size:10.5pt;}        
^/style>  
^p>
  ^img src="https://app.keysurvey.com/" alt="image" width="462" />
^/p>  
^p>
  Dear Adam,
^/p>  
^p>
  Thank you for your query, the Reference ID for your query is 
  ^strong>^u> 28600 ^/u>^/strong>
  .&nbsp; We will respond to you within the next 1-2 business days.
^/p>  
^p>For further correspondence with us, kindly reply by maintaining the 
   Reference ID number of this case in the subject line of your e-mail.
^/p>  
^p>
  Regards
^/p>

My Goal is to clear all html tags and other junk values and return a text like this:

Output :-

Dear Adam,

Thank you for your query, the Reference ID for your query is We will respond to you within the next 1-2 business days.For further correspondence with us, kindly reply by maintaining the Reference ID number of this case in the subject line of your e-mail.Regards,

I have tried tm.plugin.webmining, extractHTMLStrip however it could not clear the junk values

library(tm.plugin.webmining)
df$text1 <- extractHTMLStrip(df$text)

This has been asked and answered many times. Multiple solutions are available through various libraries or via regex. Try e.g. here or here — Radim
– Radim, Commented Jan 17, 2019 at 4:57
these doesn't help anyways thank you second link is link to my question I have tried xml, Rcurl and RVest libraries to clear junk values however these doesnt help thanks and have good day — Dwaipayan Dutta
– Dwaipayan Dutta, Commented Jan 17, 2019 at 5:26
you can try gsub("[^p]", "", x) and then repeat that for anything you want to remove. This will replace any instances of ^p with nothing — morgan121
– morgan121, Commented Jan 17, 2019 at 5:53
Sorry, must have messed up copy and paste of the links. I provided an answer using regular expressions below, but if it is a case of corrupted strings, you can do gsub("\\^", "<", df$text), which should make your hmtl tools work. — Radim
– Radim, Commented Jan 17, 2019 at 6:20

Radim · Accepted Answer · 2019-01-17 06:16:37Z

If your string has less-than signs corrupted, you can do it with regular expressions.

yourstring <- '^style> p,span,li{ font-family:Arial; font-size:10.5pt; } ^/style> ^p>^img src="https://app.keysurvey.com/" alt="image" width="462" />^/p> ^p>Dear Adam,^/p> ^p>Thank you for your query, the Reference ID for your query is ^strong>^u> 28600 ^/u>^/strong>.  We will respond to you within the next 1-2 business days.^/p> ^p>For further correspondence with us, kindly reply by maintaining the Reference ID number of this case in the subject line of your e-mail.^/p> ^p>Regards'
# reproducible example of your string

yourstring <- gsub("\\^.*?>", "", yourstring)
yourstring <- gsub("p,span.*?}", "", yourstring)
yourstring <- trimws(yourstring)

this gets you:

> yourstring
[1] "Dear Adam, Thank you for your query, the Reference ID for your query is  28600 .  We will respond to you within the next 1-2 business days. For further correspondence with us, kindly reply by maintaining the Reference ID number of this case in the subject line of your e-mail. Regards"

To make it more elegant, you can use stringr and magrittr library.

Collectives™ on Stack Overflow

Remove HTML tags from a text string and keep the text

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related