regex in R to extract value between two strings

Question

I have lines that look like this

 01:04:43.064 [12439] <2> xyz
 01:04:43.067 [12439] <2> a lmn
 01:04:43.068 [12439] <4> j klm
 x_times_wait to <3000>
 01:04:43.068 [12439] <4> j klm
 enter_object <5000> main k

I want a regex to extract only the values after the angular brackets for lines that start with a timestamp

This is what I have tried - assuming that these lines are in a data frame called nn

 split<-str_split_fixed(nn[,1], ">", 2)
 split2<-data.frame(split[,2])

The problem is that split2 gives

   xyz
   a lmn
   j klm

   j klm
   main k

How can I make sure that the empty line and main k is not returned?

solved using the stringr package aa<-str_extract(as.character(nn[,1]), "[0-9][0-9]:.*") and then did the str_split_fixed — user3707934
– user3707934, Commented Dec 18, 2014 at 21:25

vks · Accepted Answer · 2014-12-18 18:21:23Z

3

\d+(?::\d+){2}\.\d+\s+\[[^\]]+\]\s+<\d+>(.+)$

Instead of split try match and grab the group 1.See demo.

https://regex101.com/r/vN3sH3/16

or

Split by (?<=<\d>) and get split2

answered Dec 18, 2014 at 18:21

vks

68.1k11 gold badges96 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Rich Scriven · Accepted Answer · 2014-12-19 00:49:41Z

2

If a timestamp is defined as 1 or more digits followed by a :, followed by 1 or more digits and another : and then 1 or more digits, then perhaps this method would work for you.

x <- c("01:04:43.064 [12439] <2> xyz", "01:04:43.067 [12439] <2> a lmn",   
       "01:04:43.068 [12439] <4> j klm", "x_times_wait to <3000>",  
       "01:04:43.068 [12439] <4> j klm", "enter_object <5000> main k")

sub(".*> ", "", x[grepl("\\d+:\\d+:\\d+", x)])
# [1] "xyz"   "a lmn" "j klm" "j klm"

This removes all the non-timestamp elements first, then gets the values after > with the remaining elements.

edited Dec 19, 2014 at 0:49

answered Dec 18, 2014 at 18:31

Rich Scriven

99.8k11 gold badges191 silver badges252 bronze badges

Comments

Sven Hohenstein · Accepted Answer · 2014-12-18 19:07:33Z

0

Here's an approach in base R:

The regex:

^(\\d{2}:){2}\\d{2}\\.\\d{3}.*>\\s*\\K.+

You can use it with gregexpr:

unlist(regmatches(vec, gregexpr("^(\\d{2}:){2}\\d{2}\\.\\d{3}.*>\\s*\\K.+", 
                                vec, perl = TRUE)))
# [1] "xyz"   "a lmn" "j klm" "j klm"

where vec is the vector containing your strings.

answered Dec 18, 2014 at 19:07

Sven Hohenstein

82k17 gold badges150 silver badges173 bronze badges

Comments

Jim · Accepted Answer · 2014-12-19 14:49:40Z

0

Using rex may make this type of task a little simpler.

string <- "01:04:43.064 [12439] <2> xyz
01:04:43.067 [12439] <2> a lmn
01:04:43.068 [12439] <4> j klm
x_times_wait to <3000>
01:04:43.068 [12439] <4> j klm
enter_object <5000> main k"

library(rex)

timestamp <- rex(n(digit, 2), ":", n(digit, 2), ":", n(digit, 2), ".", n(digit, 3))

re <- rex(timestamp, space,
          "[", digits, "]", space,
          "<", digits, ">", space,
          capture(anything))

re_matches(string, re, global = TRUE)

#> [[1]]
#>       1
#> 1   xyz
#> 2 a lmn
#> 3 j klm
#> 4 j klm

answered Dec 19, 2014 at 14:49

Jim

4,80731 silver badges32 bronze badges

Collectives™ on Stack Overflow

regex in R to extract value between two strings

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related