1

I have trimmed down an html file to get each character vector of a data set to look like:

<h3 class=\"personName\">Whitney  Alicia Zimmerman</h3>                                             <li>Assistant Teaching Professor</li>"

I want to use regular expressions to trim it down to just the name and position (for clarification, each vector has different names and positions). What I used before won't work for this (I used the grepl function to subset my original html file). How would I go about trimming this using regular expressions or even another technique? Thanks for any help in advance.

Or if it's easier to work with, I have two other character vectors separating the two that look like:

"                                        <h3 class=\"personName\">Whitney  Alicia Zimmerman</h3>"

and

"                                            <li>Assistant Teaching Professor</li>"
2
  • 1
    Dupe of Parsing HTML file in R? Or R Read & Parse HTML to List. There are better ways than using regex when parsing HTML. Commented Apr 4, 2018 at 23:15
  • @WiktorStribiżew that's probably more practical. Unfortunately, I'm trying to learn regex better for a class I'm in, and I"m trying to avoid cutting corners with ways like that. Thanks for the advice though! Commented Apr 4, 2018 at 23:26

2 Answers 2

2

You can use sub and match everything but what you want. So I'd probably do something like

test = '<h3 class=\\"personName\\">Whitney  Alicia Zimmerman</h3>  '
sub("<.*", "", sub(".*\">", "", test))

[1] "Whitney  Alicia Zimmerman"

That gsub expression can be modified to get rid of whatever you want. The trick is to match the stuff you don't want, and substitute in the empty string.

The basic structure to sub is sub(match_string, replace_string, target). Looking at the documentation will clear it up further. I've just nested my subs so I can remove the start and end of the string.

EDIT: I included u/Onyambu's suggestion, as he is completely right. Only sub is required not gsub like I originally suggested. The difference is gsub looks for all matches, sub just looks for the first.

Below he also provides a solution using just one sub rather than two like I have.

Sign up to request clarification or add additional context in comments.

3 Comments

gsub seems like a relatively simple way to get rid of all of the junk around it. I'll try to work with that more and see what I can do with it with other stuff. Thank you
What you need is sub(".*>(.*)<.*","\\1",test) . You do not need to use the g sub ie gsub but it will still give the same results: gsub(".*>(.*)<.*","\\1",test)
Good one! I have a bad habit of using gsub all the time because most my matches are singular. I'm not a regex expert, can you explain to me how the \\1 works? Does it match the whole string, then sub in its place the 1st (as in 0th, 1st, 2nd) matched token, or something like that?
1

If you really want to use regex, here's a solution that uses stringr (as well as magrittr):

Using your long string:

htmlstring <- c("<h3 class=\"personName\">Whitney  Alicia Zimmerman</h3>                                             <li>Assistant Teaching Professor</li>")  

The code:

library(stringr)
library(magrittr)

ParsedString <- str_replace_all(htmlstring, "<[^>]+>", "") %>% # remove everything between angle brackets, inclusive
                str_squish # remove all extraneous whitespace

Output:

> ParsedString
[1] "Whitney Alicia Zimmerman Assistant Teaching Professor"

2 Comments

I don't reallllly want to use to regex, but I do need some foundation in the field. But this helps in understanding how to use them
Glad to help! For a gsub solution, try ParsedString <- gsub("<[^>]+>", "", htmlstring) %>% gsub("\\s{2,}", " ", .)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.