-4

I have a text string like the one below :-

^style>           
  p,span,li{font-family:Arial;font-size:10.5pt;}        
^/style>  
^p>
  ^img src="https://app.keysurvey.com/" alt="image" width="462" />
^/p>  
^p>
  Dear Adam,
^/p>  
^p>
  Thank you for your query, the Reference ID for your query is 
  ^strong>^u> 28600 ^/u>^/strong>
  .  We will respond to you within the next 1-2 business days.
^/p>  
^p>For further correspondence with us, kindly reply by maintaining the 
   Reference ID number of this case in the subject line of your e-mail.
^/p>  
^p>
  Regards
^/p>

My Goal is to clear all html tags and other junk values and return a text like this:

Output :-

Dear Adam,

Thank you for your query, the Reference ID for your query is We will respond to you within the next 1-2 business days.For further correspondence with us, kindly reply by maintaining the Reference ID number of this case in the subject line of your e-mail.Regards,

I have tried tm.plugin.webmining, extractHTMLStrip however it could not clear the junk values

library(tm.plugin.webmining)
df$text1 <- extractHTMLStrip(df$text)
4
  • This has been asked and answered many times. Multiple solutions are available through various libraries or via regex. Try e.g. here or here Commented Jan 17, 2019 at 4:57
  • these doesn't help anyways thank you second link is link to my question I have tried xml, Rcurl and RVest libraries to clear junk values however these doesnt help thanks and have good day Commented Jan 17, 2019 at 5:26
  • you can try gsub("[^p]", "", x) and then repeat that for anything you want to remove. This will replace any instances of ^p with nothing Commented Jan 17, 2019 at 5:53
  • Sorry, must have messed up copy and paste of the links. I provided an answer using regular expressions below, but if it is a case of corrupted strings, you can do gsub("\\^", "<", df$text), which should make your hmtl tools work. Commented Jan 17, 2019 at 6:20

1 Answer 1

0

If your string has less-than signs corrupted, you can do it with regular expressions.

yourstring <- '^style> p,span,li{ font-family:Arial; font-size:10.5pt; } ^/style> ^p>^img src="https://app.keysurvey.com/" alt="image" width="462" />^/p> ^p>Dear Adam,^/p> ^p>Thank you for your query, the Reference ID for your query is ^strong>^u> 28600 ^/u>^/strong>.  We will respond to you within the next 1-2 business days.^/p> ^p>For further correspondence with us, kindly reply by maintaining the Reference ID number of this case in the subject line of your e-mail.^/p> ^p>Regards'
# reproducible example of your string

yourstring <- gsub("\\^.*?>", "", yourstring)
yourstring <- gsub("p,span.*?}", "", yourstring)
yourstring <- trimws(yourstring)

this gets you:

> yourstring
[1] "Dear Adam, Thank you for your query, the Reference ID for your query is  28600 .  We will respond to you within the next 1-2 business days. For further correspondence with us, kindly reply by maintaining the Reference ID number of this case in the subject line of your e-mail. Regards"

To make it more elegant, you can use stringr and magrittr library.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.