I'm looking to use R to clean up some text strings from a database. The database stores the text complete with HTML tags. Unfortunately, due to database limitations, each string is broken into multiple fragments in the database. I think I could figure out how to remove the html tags with regular expressions and the help of other posts, but I don't expect those solutions will work unless I concatenate the fragments back together (opening/closing html tags can be spread across records in the dataframe). Here is some sample data:
Existing dataframe
Record_nbr fragment Comments
1 1 "The quick brown"
1 2 "fox jumped over"
1 3 "the lazy dog."
2 1 "New Record."
Desired output dataframe
Record_nbr fragment Comments
1 3 "The quick brown fox jumped over the lazy dog."
2 2 "New Record."
Data:
dat <- read.table(text='Record_nbr fragment Comments
1 1 "The quick brown"
1 2 "fox jumped over"
1 3 "the lazy dog."
2 1 "New Record."', header=TRUE)