0

I have a string containing the starting lineup (extracted from the web) for a rugby game, it looks like this:

 "Crusaders: 15 David Havili, 14 Seta Tamanivalu, 13 Jack Goodhue, 12 Ryan Crotty, 11 George Bridge, 10 Richie Mo’unga, 9 Bryn Hall, 8 Kieran Read, 7 Matt Todd, 6 Heiden Bedwell-Curtis, 5 Sam Whitelock (c), 4 Scott Barrett, 3 Owen Franks, 2 Codie Taylor, 1 Joe MoodyReplacements: 16 Sam Anderson-Heather, 17 Tim Perry, 18 Michael Alaalatoa, 19 Luke Romano, 20 Pete Samu, 21 Mitchell Drummond, 22 Mitchell Hunt, 23 Braydon Ennor"

What I want is essentially a table with two columns, one being the player's number, and the other being the player's name. e.g.

position     name
1            Joe Moody
2            Codie Taylor
3            Owen Franks
4            Scott Barrett
...          ...

For all players.

I've tried using strsplit, splitting by the "," however the problem becomes the first player:

"Crusaders: 15 David Havili"

and the number 1 and 16 merge

"1 Joe MoodyReplacements: 16 Sam Anderson-Heather".

Any ideas?

2
  • The format of your string is not consistent; for example, in almost all cases a "," (comma) is used as separator except for the part "1 Joe Moody: 16 Sam Anderson-Heather" where a ":" (colon) is the separator. Is that a typo? What do you expect to happen to the replacement players? Are they to be included in the output table? Commented Mar 11, 2019 at 4:08
  • Reimport your data and make sure you keep the newline character. Commented Mar 11, 2019 at 4:11

2 Answers 2

1

I agree with @HongOoi's comment; it's best to take a step back and ensure that data is imported in a more sensible way. That said, here is a post-hoc hacky solution. Not sure how well this generalises, if at all.

ss <-  "Crusaders: 15 David Havili, 14 Seta Tamanivalu, 13 Jack Goodhue, 12 Ryan Crotty, 11 George Bridge, 10 Richie Mo’unga, 9 Bryn Hall, 8 Kieran Read, 7 Matt Todd, 6 Heiden Bedwell-Curtis, 5 Sam Whitelock (c), 4 Scott Barrett, 3 Owen Franks, 2 Codie Taylor, 1 Joe MoodyReplacements: 16 Sam Anderson-Heather, 17 Tim Perry, 18 Michael Alaalatoa, 19 Luke Romano, 20 Pete Samu, 21 Mitchell Drummond, 22 Mitchell Hunt, 23 Braydon Ennor"


library(tidyverse)
data.frame(ss = ss) %>%
    mutate(ss = str_replace(ss, "Replacements", "")) %>%   # Remove "Replacements"
    mutate(ss = str_split(ss, "(,|:) ")) %>%               # Split on "," or ":"
    unnest() %>%
    separate(ss, c("position", "name"), sep = "(?<=\\d)\\s", fill = "right") %>%
    filter(!is.na(name))                                   # Remove the first "Crusaders" line
#   position                  name
#1        15          David Havili
#2        14       Seta Tamanivalu
#3        13          Jack Goodhue
#4        12           Ryan Crotty
#5        11         George Bridge
#6        10        Richie Mo’unga
#7         9             Bryn Hall
#8         8           Kieran Read
#9         7             Matt Todd
#10        6 Heiden Bedwell-Curtis
#11        5     Sam Whitelock (c)
#12        4         Scott Barrett
#13        3           Owen Franks
#14        2          Codie Taylor
#15        1             Joe Moody
#16       16  Sam Anderson-Heather
#17       17             Tim Perry
#18       18     Michael Alaalatoa
#19       19           Luke Romano
#20       20             Pete Samu
#21       21     Mitchell Drummond
#22       22         Mitchell Hunt
#23       23         Braydon Ennor
Sign up to request clarification or add additional context in comments.

Comments

0

Using stringr::str_match_all() and some regex you can find and extract all matches, being careful to use non-greedy (?) operator and matching end of line where there is no comma:

library(dplyr)
library(stringr)
ea <- "Crusaders: 15 David Havili, 14 Seta Tamanivalu, 13 Jack Goodhue, 12 Ryan Crotty, 11 George Bridge, 10 Richie Mo’unga, 9 Bryn Hall, 8 Kieran Read, 7 Matt Todd, 6 Heiden Bedwell-Curtis, 5 Sam Whitelock (c), 4 Scott Barrett, 3 Owen Franks, 2 Codie Taylor, 1 Joe MoodyReplacements: 16 Sam Anderson-Heather, 17 Tim Perry, 18 Michael Alaalatoa, 19 Luke Romano, 20 Pete Samu, 21 Mitchell Drummond, 22 Mitchell Hunt, 23 Braydon Ennor"
ea <- unlist(strsplit(ea, "Replacements: "))

tibble(jersey = str_match_all(ea, "\\d+") %>% unlist(),
player = str_match_all(ea, "(?<=\\d\\s).*?(?=.$|,)") %>% unlist())

# A tibble: 23 x 2
   jersey player               
   <chr>  <chr>                
 1 15     David Havili         
 2 14     Seta Tamanivalu      
 3 13     Jack Goodhue         
 4 12     Ryan Crotty          
 5 11     George Bridge  

3 Comments

didn't notice the "Replacements" but it works now for all
Hi there @Elio, thanks so much for the answer this has really helped. A few questions: - It seems that it doesn't read the last letter of the final name, any ideas? - Also, it only works (and I see that's how you've coded it) that it reads up to a comma. However there is a specific error from one game where there is a missed comma after a players name. "17 Jacques Van Rooyen, 18 Jacobie Adriaanse 19 Lourens Erasmus, 20 Marvin Orie," See there's no comma after Jacobie Adriaanse. Any ideas what to do here?
yes, you're right, it's because of the capture group; change it to "(?<=\\d\\s).*?(?=$|,)" (take the point out of the second capture group)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.