Sparklyr/Hive: how to use regex (regexp_replace) correctly?

Question

Consider the following example

dataframe_test<- data_frame(mydate = c('2011-03-01T00:00:04.226Z', '2011-03-01T00:00:04.226Z'))

# A tibble: 2 x 1
                    mydate
                     <chr>
1 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z

sdf <- copy_to(sc, dataframe_test, overwrite = TRUE)

> sdf
# Source:   table<dataframe_test> [?? x 1]
# Database: spark_connection
                    mydate
                     <chr>
1 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z

I would like to modify the character timestamp so that it has a more conventional format. I tried to do so using regexp_replace but it fails.

> sdf <- sdf %>% mutate(regex = regexp_replace(mydate, '(\\d{4})-(\\d{2})-(\\d{2})T(\\d{2}):(\\d{2}):(\\d{2}).(\\d{3})Z', '$1-$2-$3 $4:$5:$6.$7'))
> sdf
# Source:   lazy query [?? x 2]
# Database: spark_connection
                    mydate                    regex
                     <chr>                    <chr>
1 2011-03-01T00:00:04.226Z 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z 2011-03-01T00:00:04.226Z

Any ideas? What is the correct syntax?

The pattern is correct (you could use literal . in place of wildcard), you're just using a wrong function. — zero323
– zero323, Commented Jun 20, 2017 at 17:56
wait a sec, please. which function should I use? your link actually specifies the same function I use — ℕʘʘḆḽḘ
– ℕʘʘḆḽḘ, Commented Jun 20, 2017 at 18:00
Take a closer look - it is regexp_replace, not regexp_extract :) — zero323
– zero323, Commented Jun 20, 2017 at 18:01
I believe this is still a duplicate - I was just wrong about the pattern. Please note that it has to match a whole string and you didn't escape everything: sdf %>% mutate(regex = regexp_replace(mydate, '^(\\\\d{4})-(\\\\d{2})-(\\\\d{2})T(\\\\d{2}):(\\\\d{2}):(\\\\d{2}).(\\\\d{3})Z$', '$1-$2-$3 $4:$5:$6.$7')). You could use regexp_extact, but it would require enumerating all fields sdf %>% mutate(regex = regexp_extract(mydate, '^(\\\\d{4})-(\\\\d{2})-(\\\\d{2})T(\\\\d{2}):(\\\\d{2}):(\\\\d{2}).(\\\\d{3})Z$', 1)) — zero323
– zero323, Commented Jun 20, 2017 at 18:12
You have to escape once for R and once for Java I am afraid. If you think this should be a separate answer, I can reopen it. — zero323
– zero323, Commented Jun 20, 2017 at 18:18

zero323 · Accepted Answer · 2017-06-21 13:37:32Z

Spark SQL and Hive provide two different functions:

regexp_extract - which takes string, pattern and the index of the group to be extracted.
regexp_replace - which takes a string, pattern, and the replacement string.

The former one can be used to extract a single group with the index semantics being the same as for java.util.regex.Matcher

For regexp_replace pattern has to match a whole string and if there is no match, and the input string is returned:

sdf %>% mutate(
 regex = regexp_replace(mydate, '^([0-9]{4}).*', "$1"),
 regexp_bad = regexp_replace(mydate, '([0-9]{4})', "$1"))

## Source:   query [2 x 3]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
## # A tibble: 2 x 3
##                     mydate regex               regexp_bad
##                      <chr> <chr>                    <chr>
## 1 2011-03-01T00:00:04.226Z  2011 2011-03-01T00:00:04.226Z
## 2 2011-03-01T00:00:04.226Z  2011 2011-03-01T00:00:04.226Z

while with regexp_extract it is not required:

sdf %>% mutate(regex = regexp_extract(mydate, '([0-9]{4})', 1))

## Source:   query [2 x 2]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
## # A tibble: 2 x 2
##                     mydate regex
##                      <chr> <chr>
## 1 2011-03-01T00:00:04.226Z  2011
## 2 2011-03-01T00:00:04.226Z  2011

Also, due to indirect execution (R -> Java), you have to escape twice:

sdf %>% mutate(
  regex = regexp_replace(
    mydate, 
    '^(\\\\d{4})-(\\\\d{2})-(\\\\d{2})T(\\\\d{2}):(\\\\d{2}):(\\\\d{2}).(\\\\d{3})Z$',
    '$1-$2-$3 $4:$5:$6.$7'))

Normally one would use Spark datetime functions:

spark_session(sc) %>%  
  invoke("sql",
    "SELECT *, DATE_FORMAT(CAST(mydate AS timestamp), 'yyyy-MM-dd HH:mm:ss.SSS') parsed from dataframe_test") %>% 
  sdf_register


## Source:   query [2 x 2]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
## # A tibble: 2 x 2
##                     mydate                  parsed
##                      <chr>                   <chr>
## 1 2011-03-01T00:00:04.226Z 2011-03-01 01:00:04.226
## 2 2011-03-01T00:00:04.226Z 2011-03-01 01:00:04.226

but sadly sparklyr seems to be extremely limited in this area, and treats timestamps as strings.

See also change string in DF using hive command and mutate with sparklyr.

Pang · Accepted Answer · 2018-02-27 10:05:57Z

3

I had some difficulties to replace "." with "", but finally it works with:

mutate(myvar2=regexp_replace(myvar, "[.]", ""))

edited Feb 27, 2018 at 10:05

Pang

10.2k146 gold badges87 silver badges126 bronze badges

answered Feb 27, 2018 at 9:47

SébastienQUINAULT

311 bronze badge

Collectives™ on Stack Overflow

Sparklyr/Hive: how to use regex (regexp_replace) correctly?

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related