4

Consider the following example

dataframe_test<- data_frame(mydate = c('2011-03-01T00:00:04.226Z', '2011-03-01T00:00:04.226Z'))

# A tibble: 2 x 1
                    mydate
                     <chr>
1 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z

sdf <- copy_to(sc, dataframe_test, overwrite = TRUE)

> sdf
# Source:   table<dataframe_test> [?? x 1]
# Database: spark_connection
                    mydate
                     <chr>
1 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z

I would like to modify the character timestamp so that it has a more conventional format. I tried to do so using regexp_replace but it fails.

> sdf <- sdf %>% mutate(regex = regexp_replace(mydate, '(\\d{4})-(\\d{2})-(\\d{2})T(\\d{2}):(\\d{2}):(\\d{2}).(\\d{3})Z', '$1-$2-$3 $4:$5:$6.$7'))
> sdf
# Source:   lazy query [?? x 2]
# Database: spark_connection
                    mydate                    regex
                     <chr>                    <chr>
1 2011-03-01T00:00:04.226Z 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z 2011-03-01T00:00:04.226Z

Any ideas? What is the correct syntax?

11
  • 1
    The pattern is correct (you could use literal . in place of wildcard), you're just using a wrong function. Commented Jun 20, 2017 at 17:56
  • wait a sec, please. which function should I use? your link actually specifies the same function I use Commented Jun 20, 2017 at 18:00
  • 2
    Take a closer look - it is regexp_replace, not regexp_extract :) Commented Jun 20, 2017 at 18:01
  • 1
    I believe this is still a duplicate - I was just wrong about the pattern. Please note that it has to match a whole string and you didn't escape everything: sdf %>% mutate(regex = regexp_replace(mydate, '^(\\\\d{4})-(\\\\d{2})-(\\\\d{2})T(\\\\d{2}):(\\\\d{2}):(\\\\d{2}).(\\\\d{3})Z$', '$1-$2-$3 $4:$5:$6.$7')). You could use regexp_extact, but it would require enumerating all fields sdf %>% mutate(regex = regexp_extract(mydate, '^(\\\\d{4})-(\\\\d{2})-(\\\\d{2})T(\\\\d{2}):(\\\\d{2}):(\\\\d{2}).(\\\\d{3})Z$', 1)) Commented Jun 20, 2017 at 18:12
  • 1
    You have to escape once for R and once for Java I am afraid. If you think this should be a separate answer, I can reopen it. Commented Jun 20, 2017 at 18:18

2 Answers 2

8

Spark SQL and Hive provide two different functions:

  • regexp_extract - which takes string, pattern and the index of the group to be extracted.
  • regexp_replace - which takes a string, pattern, and the replacement string.

The former one can be used to extract a single group with the index semantics being the same as for java.util.regex.Matcher

For regexp_replace pattern has to match a whole string and if there is no match, and the input string is returned:

sdf %>% mutate(
 regex = regexp_replace(mydate, '^([0-9]{4}).*', "$1"),
 regexp_bad = regexp_replace(mydate, '([0-9]{4})', "$1"))

## Source:   query [2 x 3]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
## # A tibble: 2 x 3
##                     mydate regex               regexp_bad
##                      <chr> <chr>                    <chr>
## 1 2011-03-01T00:00:04.226Z  2011 2011-03-01T00:00:04.226Z
## 2 2011-03-01T00:00:04.226Z  2011 2011-03-01T00:00:04.226Z

while with regexp_extract it is not required:

sdf %>% mutate(regex = regexp_extract(mydate, '([0-9]{4})', 1))

## Source:   query [2 x 2]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
## # A tibble: 2 x 2
##                     mydate regex
##                      <chr> <chr>
## 1 2011-03-01T00:00:04.226Z  2011
## 2 2011-03-01T00:00:04.226Z  2011

Also, due to indirect execution (R -> Java), you have to escape twice:

sdf %>% mutate(
  regex = regexp_replace(
    mydate, 
    '^(\\\\d{4})-(\\\\d{2})-(\\\\d{2})T(\\\\d{2}):(\\\\d{2}):(\\\\d{2}).(\\\\d{3})Z$',
    '$1-$2-$3 $4:$5:$6.$7'))

Normally one would use Spark datetime functions:

spark_session(sc) %>%  
  invoke("sql",
    "SELECT *, DATE_FORMAT(CAST(mydate AS timestamp), 'yyyy-MM-dd HH:mm:ss.SSS') parsed from dataframe_test") %>% 
  sdf_register


## Source:   query [2 x 2]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
## # A tibble: 2 x 2
##                     mydate                  parsed
##                      <chr>                   <chr>
## 1 2011-03-01T00:00:04.226Z 2011-03-01 01:00:04.226
## 2 2011-03-01T00:00:04.226Z 2011-03-01 01:00:04.226

but sadly sparklyr seems to be extremely limited in this area, and treats timestamps as strings.

See also change string in DF using hive command and mutate with sparklyr.

Sign up to request clarification or add additional context in comments.

1 Comment

really interesting solution
3

I had some difficulties to replace "." with "", but finally it works with:

mutate(myvar2=regexp_replace(myvar, "[.]", ""))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.