R convert character string to a dataframe

Question

Here is a small sample of a larger character string that I have (no whitespaces). It contains fictional details of individuals.

Each individual is separated by a . There are 10 attributes for each individual.

txt = "EREKSON(Andrew,Hélène),female10/06/2011@Geneva(Switzerland),PPF,2000X007707,dist.093,Dt.043/996.BOUKAR(Mohamed,El-Hadi),male04/12/1956@London(England),PPF,2001X005729,dist.097,Dt.043/997.HARIMA(Olak,N’nassik,Gerad,Elisa,Jeremie),female25/06/2013@Paris(France),PPF,2009X005729,dist.088,Dt.043/998.THOMAS(Hajil,Pau,Joëli),female03/03/1980@Berlin(Germany),VAT,2010X006016,dist.078,Dt.043/999."

I'd like to parse this into a dataframe, with as many observations as there are individuals and 10 columns for each variable.

I've tried using regex and looking at other text extraction solutions on stackoverflow, but haven't been able to reach the output I want.

This is the final dataframe I have in mind, based on the character string input -

result = data.frame(first_names = c('Hélène Andrew','Mohamed El-Hadi','Olak N’nassik Gerad Elisa Jeremie','Joëli Pau Hajil'),
                    family_name = c('EREKSON','BOUKAR','HARIMA','THOMAS'),
                    gender = c('male','male','female','female'),
                    birthday = c('10/06/2011','04/12/1956','25/06/2013','03/03/1980'),
                    birth_city = c('Geneva','London','Paris','Berlin'),
                    birth_country = c('Switzerland','England','France','Germany'),
                    acc_type = c('PPF','PPF','PPF','VAT'),
                    acc_num = c('2000X007707','2001X005729','2009X005729','2010X006016'),
                    district = c('dist.093','dist.097','dist.088','dist.078'),
                    code = c('Dt.043/996','Dt.043/997','Dt.043/998','Dt.043/999'))

Any help would be much appreciated

I guess you can go from the following: library(tidyverse) txt %>% str_split("(?<=\\d)\\.(?=[A-Z])") %>% enframe %>% unnest(everything()) %>% mutate(value = str_split(value, "\\),")) %>% unnest_wider(value)... — PaulS
– PaulS, Commented Mar 19, 2022 at 15:00
Well, that is just a start: You have to put some more work on it. — PaulS
– PaulS, Commented Mar 19, 2022 at 15:08

Chris Ruehlemann · Accepted Answer · 2022-03-19 15:58:20Z

3

Here's a tidy solution with tidyr's functions separate_rows and extract:

library(tidyr)
data.frame(txt) %>%
  # separate `txt` into rows using the dot `.` *if* 
  # preceded by `Dt\\.\\d{3}/\\d{3}` as splitting pattern:
  separate_rows(txt, sep = "(?<=Dt\\.\\d{3}/\\d{3})\\.(?!$)") %>%
  extract(
          # select column from which to extract:
          txt,
          # define column names into which to extract:
          into = c("family_name","first_names","gender",
                   "birthday","birth_city","birth_country",
                   "acc_type","acc_num","district","code"),
          # describe the string exhaustively using capturing groups
          # `(...)` to delimit what's to be extracted:
          regex = "([A-Z]+)\\(([\\w,]+)\\),([a-z]+)([\\d/]+)@(\\w+)\\((\\w+)\\),([A-Z]+),(\\w+),dist.(\\d+),Dt\\.([\\d/]+)")
# A tibble: 4 × 10
  family_name first_names    gender birthday   birth_city birth_country acc_type acc_num  
  <chr>       <chr>          <chr>  <chr>      <chr>      <chr>         <chr>    <chr>    
1 EREKSON     Andrew,Peter   male   10/06/2011 Geneva     Switzerland   PPF      2000X007…
2 OBAMA       Barack,Hussian male   04/12/1956 London     England       PPF      2001X005…
3 CLINTON     Hillary        female 25/06/2013 Paris      France        PPF      2009X005…
4 GATES       Melinda        female 03/03/1980 Berlin     Germany       VAT      2010X006…
# … with 2 more variables: district <chr>, code <chr>

edited Mar 19, 2022 at 15:58

answered Mar 19, 2022 at 15:53

Chris Ruehlemann

21.5k4 gold badges15 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

LDT Over a year ago

any person who helps any other person with regex deserves an applause! Congrats ~Chris

jpiversen · Accepted Answer · 2022-03-20 15:00:40Z

1

Here is a solution using the tidyverse which pipes together different stringr functions to clean the string, before having readr read it, basically as a CSV:

library(dplyr, warn.conflicts = FALSE) # for pipes

df <- 
  txt %>% 
  
  # Replace "." sep with newline
  stringr::str_replace_all(
    "\\.[A-Z]", 
    function(x) stringr::str_replace(x, "\\.", "\n")
  ) %>% 
  
  # Replace all commas in (First[,Middle1,Middle2,...]) with space
  stringr::str_replace_all(
    # Match anything inside brackets, but as few times as possible, so we don't
    # match multiple brackets
    "\\(.*?\\)", 
    # Inside the regex that was matched, replace comma with space
    function(x) stringr::str_replace_all(x, ",", " ")
  ) %>% 
  
  # Replace ( with ,
  stringr::str_replace_all("\\(", ",") %>%
  
  # Remove )
  stringr::str_remove_all("\\)") %>%
  
  # Replace @ with ,
  stringr::str_replace_all("@", ",") %>%
  
  # Remove the last "."
  stringr::str_replace_all("\\.$", "\n") %>% 
  
  # Add , after female/male
  stringr::str_replace_all("male", "male,") %>% 
  
  # Read as comma delimited file (works since string contains \n)
  readr::read_delim(
    file = .,
    delim = ",",
    col_names = FALSE,
    show_col_types = FALSE
  )

# Add names (could also be done directly in read_delim with col_names argument)
names(df) <- c(
  "family_name",
  "first_names",
  "gender",
  "birthday",
  "birth_city",
  "birth_country",
  "acc_type",
  "acc_num",
  "district",
  "code"
)

df
#> # A tibble: 4 × 10
#>   family_name first_names      gender birthday birth_city birth_country acc_type
#>   <chr>       <chr>            <chr>  <chr>    <chr>      <chr>         <chr>   
#> 1 EREKSON     Andrew Hélène    female 10/06/2… Geneva     Switzerland   PPF     
#> 2 BOUKAR      Mohamed El-Hadi  male   04/12/1… London     England       PPF     
#> 3 HARIMA      Olak N’nassik G… female 25/06/2… Paris      France        PPF     
#> 4 THOMAS      Hajil Pau Joëli  female 03/03/1… Berlin     Germany       VAT     
#> # … with 3 more variables: acc_num <chr>, district <chr>, code <chr>

^{Created on 2022-03-20 by the reprex package (v2.0.1)}

Note that there probably exists more efficient regex'es one could use, but I believe this is simpler and easier to change later.

edited Mar 20, 2022 at 15:00

answered Mar 19, 2022 at 15:25

jpiversen

3,2421 gold badge11 silver badges13 bronze badges

10 Comments

Varun Over a year ago

Thank you! I noticed that the splitting of first names isn't right when the first names contain special characters like 'è' , '’' , '-' , 'è' , 'ï' , ''' and so on

Varun Over a year ago

I have slightly changed the input 'txt' to consider a case when an individual has 3 first names (in some cases even 4 first names exist in the original dataframe). How does the code adapt to take multiple first names into account?

jpiversen Over a year ago

Not much would have to be changed, just (1) the regex that locates the first/middle names inside the parentheses from only accepting English letters to accepting all words, and (2) the str_replace() with str_replace_all() to allow for several middle names. I have updated my answer with code that works with your new data.

Varun Over a year ago

This looks great, now multiple first names with special characters seem to be working well! I noticed just a few cases where it still doesn't work. I've updated the question to reflect them. Any idea why these aren't working but the rest are?

Varun Over a year ago

Understood, I need to brush up on my regex.. Thanks so much! Accepting your answer.

|

Collectives™ on Stack Overflow

R convert character string to a dataframe

2 Answers 2

1 Comment

10 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

10 Comments

Your Answer

Sign up or log in

Post as a guest

Related