0

I'm looking to use R to clean up some text strings from a database. The database stores the text complete with HTML tags. Unfortunately, due to database limitations, each string is broken into multiple fragments in the database. I think I could figure out how to remove the html tags with regular expressions and the help of other posts, but I don't expect those solutions will work unless I concatenate the fragments back together (opening/closing html tags can be spread across records in the dataframe). Here is some sample data:

Existing dataframe

Record_nbr  fragment    Comments
1   1   "The quick brown"
1   2   "fox jumped over"
1   3   "the lazy dog."
2   1   "New Record."

Desired output dataframe

Record_nbr  fragment    Comments
1   3   "The quick brown fox jumped over the lazy dog."
2   2   "New Record."

Data:

dat <- read.table(text='Record_nbr  fragment    Comments
1   1   "The quick brown"
1   2   "fox jumped over"
1   3   "the lazy dog."
2   1   "New Record."', header=TRUE)

5 Answers 5

1

I am assuming that you didn't actually want to keep the fragment column. In this case you can use this quick one-liner:

aggregate(comment ~ Record_nbr, data=dat, function(x) paste(x, collapse=" "))
Sign up to request clarification or add additional context in comments.

Comments

0

It seems like the fragment column becomes unusable after the split? Maybe

> aggregate(dat[3], dat[1], paste)
#   Record_nbr                                             x
# 1          1 The quick brown fox jumped over the lazy dog.
# 2          2                                   New Record.

equivalent to

aggregate(Comments~Record_nbr, data = dat, paste)

2 Comments

grouped <- aggregate(dataframe[[12]], dataframe[1:9],paste, collapse = " ")
like aggregate(dat[-1], dat[1], paste) for this example
0

Here's one of many approaches:

## ensure order
dat <- with(dat, dat[order(Record_nbr, fragment), ])

do.call(rbind, lapply(split(dat, dat$Record_nbr), function(x) {
    data.frame(
        x[1, 1, drop=FALSE], 
        fragment = max(x[, 2]), 
        Comments = paste(x$Comments, collapse=" ")
    )
}))

##   Record_nbr fragment                                      Comments
## 1          1        3 The quick brown fox jumped over the lazy dog.
## 2          2        1                                   New Record.

Comments

0

Using dplyr:

library(dplyr)
dat %>% 
group_by(Record_nbr) %>% 
summarize(fragment= n(), Comments=paste(Comments, collapse= " "))

#  Record_nbr fragment                                      Comments
#1          1        3 The quick brown fox jumped over the lazy dog.
#2          2        1                                   New Record.

Comments

0

Also consider using the quicker 'aggregate' function:

aggregate(dat,  by=list(dat$Record_nbr), paste, collapse=" ")

##   Group.1 Record_nbr fragment                                      Comments
## 1       1      1 1 1    1 2 3 The quick brown fox jumped over the lazy dog.
## 2       2          2        1                                   New Record.

Edit: You might have to play with the function inputs to get the exact outcome you want.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.