1

I have a main data frame contains lots of websites that I'm working with and another data frame contains a list of bad websites to match and identify whether I have bad websites in my main data frame. Since I'm very new to this, I'm not sure how to match and replace the bad websites to "www.badwebsite.com"? Thanks.

Here is an example of the data frames:

site_list <- data.frame("host" = c("www.companya.com", "www.companyb.com", "www.malwaresite.com",
                                   "www.companyc.com", "www.companyd.com", "www.virussite.com",
                                   "www.companye.com", "www.companyf.com", "www.phishingsite.com"),
                        "URL" = c("www.companya.com/home", "www.companyb.com/home", "www.malwaresite.com/home",
                                  "www.companyc.com/home", "www.companyd.com/home", "www.virussite.com/home",
                                  "www.companye.com/home", "www.companyf.com/home", "www.phishingsite.com/home"))

bad_site_list <- data.frame("host" = c("www.malwaresite.com", "www.virussite.com", "www.phishingsite.com"))

I hope to achieve this result:

host                                  URL
www.companya.com               www.companya.com/home
www.companyb.com               www.companyb.com/home
www.badwebsite.com             www.badwebsite.com/home
www.companyc.com               www.companyc.com/home
www.companyd.com               www.companyd.com/home
www.badwebsite.com             www.badwebsite.com/home
www.companye.com               www.companye.com/home
www.companyf.com               www.companyf.com/home
www.badwebsite.com             www.badwebsite.com/home

3 Answers 3

1

Without regex you could so something like this:

# Converting factor columsn to character
site_list[] <- lapply(site_list, as.character)
bad_site_list[] <- lapply(bad_site_list, as.character)

# If you want to replace all the bad sites with "www.badwebsite.com" you can:
site_list$URL[site_list$host %in% bad_site_list$host] <- "www.badwebsite.com/home"
site_list$host[site_list$host %in% bad_site_list$host] <- "www.badwebsite.com"

site_list
                host                     URL
1   www.companya.com   www.companya.com/home
2   www.companyb.com   www.companyb.com/home
3 www.badwebsite.com www.badwebsite.com/home
4   www.companyc.com   www.companyc.com/home
5   www.companyd.com   www.companyd.com/home
6 www.badwebsite.com www.badwebsite.com/home
7   www.companye.com   www.companye.com/home
8   www.companyf.com   www.companyf.com/home
9 www.badwebsite.com www.badwebsite.com/home

Using regex you could so something like this:

# Using regex you could create a pattern 
bad_site_pattern <- paste(bad_site_list$host, collapse = "|")

# Then replace all instances in the dataframe using lapply
site_list[] <- lapply(site_list, gsub, pattern = bad_site_pattern, replacement = "www.badwebsite.com")

site_list
                host                     URL
1   www.companya.com   www.companya.com/home
2   www.companyb.com   www.companyb.com/home
3 www.badwebsite.com www.badwebsite.com/home
4   www.companyc.com   www.companyc.com/home
5   www.companyd.com   www.companyd.com/home
6 www.badwebsite.com www.badwebsite.com/home
7   www.companye.com   www.companye.com/home
8   www.companyf.com   www.companyf.com/home
9 www.badwebsite.com www.badwebsite.com/home
Sign up to request clarification or add additional context in comments.

7 Comments

what does the "|" do in the pattern?
| is an operator for "or". So the paste(bad_site_list$host, collapse = "|") takes your vector of bad sites, pastes them into one string, and puts an "or" between each one. When searching for the pattern, it is searching for site1 OR site2 OR site3. Does that clarify things?
I see, but I received this error message: Error in FUN(X[[i]], ...) : assertion 'tree->num_tags == num_tags' failed in executing regexp when I ran it. Does it mean I have problem with my website list?
I've never seen that error. Looks like it can be because the pattern is too long, possibly. Check out this similar issue and their solution: stackoverflow.com/questions/28684438/…. What is the length of your bad site list? I.e., length(bad_site_list$host)
that could be the reason too, it's a little over 1100 rows
|
1

I would do it the following way for your simple example, might not be optimal for more complex tables:

apply(site_list, 2, function(x)gsub(paste(bad_site_list$host, collapse="|"), "www.badwebsite.com", x))

In apply: "2" means you will apply a function on each column ("1" to apply per row).
The function looks for all the hosts in bad_site_list and replaces them with www.badwebsite.com (using gsub)

1 Comment

Is it possible to replace for both columns?
0

Load library(stringr)

Search for a string in a vector:

str_detect(dataframe_name, "string_your_searching_for")

Replace String in Vector:

str_replace(dataframe_name, "old_string", "new_string")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.