Using regular expressions to modify a string in R

Question

I have trimmed down an html file to get each character vector of a data set to look like:

<h3 class=\"personName\">Whitney  Alicia Zimmerman</h3>                                             <li>Assistant Teaching Professor</li>"

I want to use regular expressions to trim it down to just the name and position (for clarification, each vector has different names and positions). What I used before won't work for this (I used the grepl function to subset my original html file). How would I go about trimming this using regular expressions or even another technique? Thanks for any help in advance.

Or if it's easier to work with, I have two other character vectors separating the two that look like:

"                                        <h3 class=\"personName\">Whitney  Alicia Zimmerman</h3>"

and

"                                            <li>Assistant Teaching Professor</li>"

Dupe of Parsing HTML file in R? Or R Read & Parse HTML to List. There are better ways than using regex when parsing HTML. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Apr 4, 2018 at 23:15
@WiktorStribiżew that's probably more practical. Unfortunately, I'm trying to learn regex better for a class I'm in, and I"m trying to avoid cutting corners with ways like that. Thanks for the advice though! — Zack Vlliet
– Zack Vlliet, Commented Apr 4, 2018 at 23:26

LachlanO · Accepted Answer · 2018-04-05 00:14:20Z

2

You can use sub and match everything but what you want. So I'd probably do something like

test = '<h3 class=\\"personName\\">Whitney  Alicia Zimmerman</h3>  '
sub("<.*", "", sub(".*\">", "", test))

[1] "Whitney  Alicia Zimmerman"

That gsub expression can be modified to get rid of whatever you want. The trick is to match the stuff you don't want, and substitute in the empty string.

The basic structure to sub is sub(match_string, replace_string, target). Looking at the documentation will clear it up further. I've just nested my subs so I can remove the start and end of the string.

EDIT: I included u/Onyambu's suggestion, as he is completely right. Only sub is required not gsub like I originally suggested. The difference is gsub looks for all matches, sub just looks for the first.

Below he also provides a solution using just one sub rather than two like I have.

edited Apr 5, 2018 at 0:14

answered Apr 4, 2018 at 23:23

LachlanO

1,1628 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Zack Vlliet Over a year ago

gsub seems like a relatively simple way to get rid of all of the junk around it. I'll try to work with that more and see what I can do with it with other stuff. Thank you

Onyambu Over a year ago

What you need is sub(".*>(.*)<.*","\\1",test) . You do not need to use the g sub ie gsub but it will still give the same results: gsub(".*>(.*)<.*","\\1",test)

LachlanO Over a year ago

Good one! I have a bad habit of using gsub all the time because most my matches are singular. I'm not a regex expert, can you explain to me how the \\1 works? Does it match the whole string, then sub in its place the 1st (as in 0th, 1st, 2nd) matched token, or something like that?

Marcus Campbell · Accepted Answer · 2018-04-04 23:40:43Z

1

If you really want to use regex, here's a solution that uses stringr (as well as magrittr):

Using your long string:

htmlstring <- c("<h3 class=\"personName\">Whitney  Alicia Zimmerman</h3>                                             <li>Assistant Teaching Professor</li>")

The code:

library(stringr)
library(magrittr)

ParsedString <- str_replace_all(htmlstring, "<[^>]+>", "") %>% # remove everything between angle brackets, inclusive
                str_squish # remove all extraneous whitespace

Output:

> ParsedString
[1] "Whitney Alicia Zimmerman Assistant Teaching Professor"

edited Apr 4, 2018 at 23:40

answered Apr 4, 2018 at 23:34

Marcus Campbell

2,8344 gold badges25 silver badges37 bronze badges

2 Comments

Zack Vlliet Over a year ago

I don't reallllly want to use to regex, but I do need some foundation in the field. But this helps in understanding how to use them

Marcus Campbell Over a year ago

Glad to help! For a gsub solution, try ParsedString <- gsub("<[^>]+>", "", htmlstring) %>% gsub("\\s{2,}", " ", .)

Collectives™ on Stack Overflow

Using regular expressions to modify a string in R

2 Answers 2

3 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related