How to transform a written sequence into a numeric sequence? (R)

Question

I'm in trouble with a dataset provided by the Brazilian government (therefore, it is in portuguese). Here's the code that imports it:

library(tidyverse)
locais_vot_SP <- read_delim("https://raw.githubusercontent.com/camilagonc/votacao_secao/master/locais_vot_SP.csv",
                        locale = locale(encoding = "ISO-8859-1"),
                        delim = ",",
                        col_names = F) %>% 
              filter(X4 == "VINHEDO")

names(locais_vot_SP) <- c("num_zona", 
                      "nome_local",
                      "endereco",
                      "nome_municipio",
                      "secoes",
                      "secoes_esp")

As it can be noticed, the values of the variable secoes are not properly organized, since different data are aggregated in the same cell.

secoes
196ª; 207ª; 221ª; 231ª;
197ª; 211ª; 230ª; 249ª;

With the following code, I started to fix the problem:

locais_vot_SP <- locais_vot_SP %>% mutate(secoes = gsub("ª", "", secoes)) %>% 
                                   mutate(secoes_esp = gsub("ª", "", secoes_esp)) %>%
                                   mutate(secoes_esp = gsub(";", "", secoes_esp)) %>%
                                   mutate(secoes = gsub("Da ", "", secoes)) %>% 
                                   separate_rows(secoes, sep = ";") %>%  
                                   mutate(secoes = unlist(strsplit(locais_vot_SP$secoes, ";")))

And so I got to this:

secoes
32 à 38
100
121

What still needs to be solved are the cases in which there is x à y (in English, x to y). How can I get the following output?

secoes
32
33
34
35
36
37
38
...

jasbner · Accepted Answer · 2018-04-25 20:28:14Z

I tried to keep your basic workflow but used gsubfn to apply a function to the regular expression that was used to extract the two numbers that needed to be extrapolated.

library(gsubfn)
locais_vot_SP <- locais_vot_SP %>% 
                                   mutate(secoes = unlist(strsplit(gsubfn("(\\d+)ª à (\\d+)", function(x,y) paste0(seq(x,y),collapse = "ª;"),secoes),","))) %>% 
                                   mutate(secoes = gsub("ª", "", secoes)) %>% 
                                   mutate(secoes_esp = gsub("ª", "", secoes_esp)) %>%
                                   mutate(secoes_esp = gsub(";", "", secoes_esp)) %>%
                                   mutate(secoes = gsub("Da ", "", secoes)) %>% 
                                   mutate(secoes = gsub(" ", "", secoes)) %>% 
                                   mutate(secoes = gsub(";$", "", secoes)) %>% 
                                   separate_rows(secoes, sep = ";")

Robin Gertenbach · Accepted Answer · 2018-04-25 20:21:26Z

0

by creating a range you will change the length of the column. Sicne you seem to only care about that column it's easiest do this

map(
    locais_vot_SP$secoes,
    ~seq(
      as.numeric(str_extract(., "^(\\d+)")),
      as.numeric(str_extract(., "(\\d+)$")))) %>% 
  reduce(c)

or continue your pipeline by doing %>% pull(secoes) %>% map(...) %>% reduce(c) %>% data.frame(secoes = .) If you need it in a 1-column dataframe.

If there are other columns you worry about you can continue the pipeline with

%>%
  mutate(secoes = map(...)) %>%
  unnest(secoes)

to flatten on secoes

answered Apr 25, 2018 at 20:21

Robin Gertenbach

10.9k3 gold badges30 silver badges38 bronze badges

Collectives™ on Stack Overflow

How to transform a written sequence into a numeric sequence? (R)

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related