3

Here is a small sample of a larger character string that I have (no whitespaces). It contains fictional details of individuals.

Each individual is separated by a . There are 10 attributes for each individual.

txt = "EREKSON(Andrew,Hélène),female10/06/2011@Geneva(Switzerland),PPF,2000X007707,dist.093,Dt.043/996.BOUKAR(Mohamed,El-Hadi),male04/12/1956@London(England),PPF,2001X005729,dist.097,Dt.043/997.HARIMA(Olak,N’nassik,Gerad,Elisa,Jeremie),female25/06/2013@Paris(France),PPF,2009X005729,dist.088,Dt.043/998.THOMAS(Hajil,Pau,Joëli),female03/03/1980@Berlin(Germany),VAT,2010X006016,dist.078,Dt.043/999."

I'd like to parse this into a dataframe, with as many observations as there are individuals and 10 columns for each variable.

I've tried using regex and looking at other text extraction solutions on stackoverflow, but haven't been able to reach the output I want.

This is the final dataframe I have in mind, based on the character string input -

result = data.frame(first_names = c('Hélène Andrew','Mohamed El-Hadi','Olak N’nassik Gerad Elisa Jeremie','Joëli Pau Hajil'),
                    family_name = c('EREKSON','BOUKAR','HARIMA','THOMAS'),
                    gender = c('male','male','female','female'),
                    birthday = c('10/06/2011','04/12/1956','25/06/2013','03/03/1980'),
                    birth_city = c('Geneva','London','Paris','Berlin'),
                    birth_country = c('Switzerland','England','France','Germany'),
                    acc_type = c('PPF','PPF','PPF','VAT'),
                    acc_num = c('2000X007707','2001X005729','2009X005729','2010X006016'),
                    district = c('dist.093','dist.097','dist.088','dist.078'),
                    code = c('Dt.043/996','Dt.043/997','Dt.043/998','Dt.043/999'))

Any help would be much appreciated

3
  • I guess you can go from the following: library(tidyverse) txt %>% str_split("(?<=\\d)\\.(?=[A-Z])") %>% enframe %>% unnest(everything()) %>% mutate(value = str_split(value, "\\),")) %>% unnest_wider(value)... Commented Mar 19, 2022 at 15:00
  • Doesn't seem to be splitting it the right way Commented Mar 19, 2022 at 15:07
  • Well, that is just a start: You have to put some more work on it. Commented Mar 19, 2022 at 15:08

2 Answers 2

3

Here's a tidy solution with tidyr's functions separate_rows and extract:

library(tidyr)
data.frame(txt) %>%
  # separate `txt` into rows using the dot `.` *if* 
  # preceded by `Dt\\.\\d{3}/\\d{3}` as splitting pattern:
  separate_rows(txt, sep = "(?<=Dt\\.\\d{3}/\\d{3})\\.(?!$)") %>%
  extract(
          # select column from which to extract:
          txt,
          # define column names into which to extract:
          into = c("family_name","first_names","gender",
                   "birthday","birth_city","birth_country",
                   "acc_type","acc_num","district","code"),
          # describe the string exhaustively using capturing groups
          # `(...)` to delimit what's to be extracted:
          regex = "([A-Z]+)\\(([\\w,]+)\\),([a-z]+)([\\d/]+)@(\\w+)\\((\\w+)\\),([A-Z]+),(\\w+),dist.(\\d+),Dt\\.([\\d/]+)")
# A tibble: 4 × 10
  family_name first_names    gender birthday   birth_city birth_country acc_type acc_num  
  <chr>       <chr>          <chr>  <chr>      <chr>      <chr>         <chr>    <chr>    
1 EREKSON     Andrew,Peter   male   10/06/2011 Geneva     Switzerland   PPF      2000X007…
2 OBAMA       Barack,Hussian male   04/12/1956 London     England       PPF      2001X005…
3 CLINTON     Hillary        female 25/06/2013 Paris      France        PPF      2009X005…
4 GATES       Melinda        female 03/03/1980 Berlin     Germany       VAT      2010X006…
# … with 2 more variables: district <chr>, code <chr>
Sign up to request clarification or add additional context in comments.

1 Comment

any person who helps any other person with regex deserves an applause! Congrats ~Chris
1

Here is a solution using the tidyverse which pipes together different stringr functions to clean the string, before having readr read it, basically as a CSV:

library(dplyr, warn.conflicts = FALSE) # for pipes

df <- 
  txt %>% 
  
  # Replace "." sep with newline
  stringr::str_replace_all(
    "\\.[A-Z]", 
    function(x) stringr::str_replace(x, "\\.", "\n")
  ) %>% 
  
  # Replace all commas in (First[,Middle1,Middle2,...]) with space
  stringr::str_replace_all(
    # Match anything inside brackets, but as few times as possible, so we don't
    # match multiple brackets
    "\\(.*?\\)", 
    # Inside the regex that was matched, replace comma with space
    function(x) stringr::str_replace_all(x, ",", " ")
  ) %>% 
  
  # Replace ( with ,
  stringr::str_replace_all("\\(", ",") %>%
  
  # Remove )
  stringr::str_remove_all("\\)") %>%
  
  # Replace @ with ,
  stringr::str_replace_all("@", ",") %>%
  
  # Remove the last "."
  stringr::str_replace_all("\\.$", "\n") %>% 
  
  # Add , after female/male
  stringr::str_replace_all("male", "male,") %>% 
  
  # Read as comma delimited file (works since string contains \n)
  readr::read_delim(
    file = .,
    delim = ",",
    col_names = FALSE,
    show_col_types = FALSE
  )

# Add names (could also be done directly in read_delim with col_names argument)
names(df) <- c(
  "family_name",
  "first_names",
  "gender",
  "birthday",
  "birth_city",
  "birth_country",
  "acc_type",
  "acc_num",
  "district",
  "code"
)

df
#> # A tibble: 4 × 10
#>   family_name first_names      gender birthday birth_city birth_country acc_type
#>   <chr>       <chr>            <chr>  <chr>    <chr>      <chr>         <chr>   
#> 1 EREKSON     Andrew Hélène    female 10/06/2… Geneva     Switzerland   PPF     
#> 2 BOUKAR      Mohamed El-Hadi  male   04/12/1… London     England       PPF     
#> 3 HARIMA      Olak N’nassik G… female 25/06/2… Paris      France        PPF     
#> 4 THOMAS      Hajil Pau Joëli  female 03/03/1… Berlin     Germany       VAT     
#> # … with 3 more variables: acc_num <chr>, district <chr>, code <chr>

Created on 2022-03-20 by the reprex package (v2.0.1)

Note that there probably exists more efficient regex'es one could use, but I believe this is simpler and easier to change later.

10 Comments

Thank you! I noticed that the splitting of first names isn't right when the first names contain special characters like 'è' , '’' , '-' , 'è' , 'ï' , ''' and so on
I have slightly changed the input 'txt' to consider a case when an individual has 3 first names (in some cases even 4 first names exist in the original dataframe). How does the code adapt to take multiple first names into account?
Not much would have to be changed, just (1) the regex that locates the first/middle names inside the parentheses from only accepting English letters to accepting all words, and (2) the str_replace() with str_replace_all() to allow for several middle names. I have updated my answer with code that works with your new data.
This looks great, now multiple first names with special characters seem to be working well! I noticed just a few cases where it still doesn't work. I've updated the question to reflect them. Any idea why these aren't working but the rest are?
Understood, I need to brush up on my regex.. Thanks so much! Accepting your answer.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.