1

I'm in trouble with a dataset provided by the Brazilian government (therefore, it is in portuguese). Here's the code that imports it:

library(tidyverse)
locais_vot_SP <- read_delim("https://raw.githubusercontent.com/camilagonc/votacao_secao/master/locais_vot_SP.csv",
                        locale = locale(encoding = "ISO-8859-1"),
                        delim = ",",
                        col_names = F) %>% 
              filter(X4 == "VINHEDO")

names(locais_vot_SP) <- c("num_zona", 
                      "nome_local",
                      "endereco",
                      "nome_municipio",
                      "secoes",
                      "secoes_esp")

As it can be noticed, the values of the variable secoes are not properly organized, since different data are aggregated in the same cell.

secoes
196ª; 207ª; 221ª; 231ª;
197ª; 211ª; 230ª; 249ª;

With the following code, I started to fix the problem:

locais_vot_SP <- locais_vot_SP %>% mutate(secoes = gsub("ª", "", secoes)) %>% 
                                   mutate(secoes_esp = gsub("ª", "", secoes_esp)) %>%
                                   mutate(secoes_esp = gsub(";", "", secoes_esp)) %>%
                                   mutate(secoes = gsub("Da ", "", secoes)) %>% 
                                   separate_rows(secoes, sep = ";") %>%  
                                   mutate(secoes = unlist(strsplit(locais_vot_SP$secoes, ";")))

And so I got to this:

secoes
32 à 38
100
121

What still needs to be solved are the cases in which there is x à y (in English, x to y). How can I get the following output?

secoes
32
33
34
35
36
37
38
...

2 Answers 2

1

I tried to keep your basic workflow but used gsubfn to apply a function to the regular expression that was used to extract the two numbers that needed to be extrapolated.

library(gsubfn)
locais_vot_SP <- locais_vot_SP %>% 
                                   mutate(secoes = unlist(strsplit(gsubfn("(\\d+)ª à (\\d+)", function(x,y) paste0(seq(x,y),collapse = "ª;"),secoes),","))) %>% 
                                   mutate(secoes = gsub("ª", "", secoes)) %>% 
                                   mutate(secoes_esp = gsub("ª", "", secoes_esp)) %>%
                                   mutate(secoes_esp = gsub(";", "", secoes_esp)) %>%
                                   mutate(secoes = gsub("Da ", "", secoes)) %>% 
                                   mutate(secoes = gsub(" ", "", secoes)) %>% 
                                   mutate(secoes = gsub(";$", "", secoes)) %>% 
                                   separate_rows(secoes, sep = ";")
Sign up to request clarification or add additional context in comments.

Comments

0

by creating a range you will change the length of the column. Sicne you seem to only care about that column it's easiest do this

map(
    locais_vot_SP$secoes,
    ~seq(
      as.numeric(str_extract(., "^(\\d+)")),
      as.numeric(str_extract(., "(\\d+)$")))) %>% 
  reduce(c)

or continue your pipeline by doing %>% pull(secoes) %>% map(...) %>% reduce(c) %>% data.frame(secoes = .) If you need it in a 1-column dataframe.

If there are other columns you worry about you can continue the pipeline with

%>%
  mutate(secoes = map(...)) %>%
  unnest(secoes)

to flatten on secoes

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.