I have a text string like the one below :-
^style>
p,span,li{font-family:Arial;font-size:10.5pt;}
^/style>
^p>
^img src="https://app.keysurvey.com/" alt="image" width="462" />
^/p>
^p>
Dear Adam,
^/p>
^p>
Thank you for your query, the Reference ID for your query is
^strong>^u> 28600 ^/u>^/strong>
. We will respond to you within the next 1-2 business days.
^/p>
^p>For further correspondence with us, kindly reply by maintaining the
Reference ID number of this case in the subject line of your e-mail.
^/p>
^p>
Regards
^/p>
My Goal is to clear all html tags and other junk values and return a text like this:
Output :-
Dear Adam,
Thank you for your query, the Reference ID for your query is We will respond to you within the next 1-2 business days.For further correspondence with us, kindly reply by maintaining the Reference ID number of this case in the subject line of your e-mail.Regards,
I have tried tm.plugin.webmining, extractHTMLStrip however it could not clear the junk values
library(tm.plugin.webmining)
df$text1 <- extractHTMLStrip(df$text)
gsub("[^p]", "", x)and then repeat that for anything you want to remove. This will replace any instances of^pwith nothinggsub("\\^", "<", df$text), which should make your hmtl tools work.