1

I know about "string"[/regex/], which returns the part of the string that matches. But what if I want to return only the captured part(s) of a string?

I have the string "1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3". I want to store in the variable title the text The_Case_of_the_Gold_Ring.

I can capture this part with the regex /\d_(?!.*\d_)(.*).mp3$/i. But writing the Ruby "1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3"[/\d_(?!.*\d_)(.*).mp3$/i] returns 0_The_Case_of_the_Gold_Ring.mp3 which isn't what I want.

I can get what I want by writing

"1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3" =~ /\d_(?!.*\d_)(.*).mp3$/i
title = $~.captures[0]

But this seems sloppy. Surely there's a proper way to do this?

(I'm aware that someone can probably write a simpler regex to target the text I want that lets the "string"[/regex/] method work, but this is just an example to illustrate the problem, the specific regex isn't the issue.)

5 Answers 5

5

You can pass number of part to [/regexp/, index] method:

=> string = "1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3"
=> string[/\d_(?!.*\d_)(.*).mp3$/i, 1]
=> "The_Case_of_the_Gold_Ring"
=> string[/\d_(?!.*\d_)(.*).mp3$/i, 0]
=> "0_The_Case_of_the_Gold_Ring.mp3"
Sign up to request clarification or add additional context in comments.

Comments

2

Have a look at the match method:

string = "1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3"
regexp = /\d_(?!.*\d_)(.*).mp3$/i

matches = regexp.match(string)
matches[1]
#=> "The_Case_of_the_Gold_Ring"

Where matches[0] would return the whole match and matches[1] (and following) returns all subcaptures:

matches.to_a    
#=> ["0_The_Case_of_the_Gold_Ring.mp3", "The_Case_of_the_Gold_Ring"]

Read more examples: http://ruby-doc.org/core-2.1.4/MatchData.html#method-i-5B-5D

Comments

1

You can use named captures

"1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3" =~ /\d_(?!.*\d_)(?<title>.*).mp3$/i

and $~[:title] will give you want you want

Comments

1

Meditate on this:

Here's the source string to be parsed:

str = "1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3"

Patterns can be defined as strings:

DATE_REGEX = '\d{4}-[A-Z]{3}-\d{2}'
SERIAL_REGEX = '\d{2}'
TITLE_REGEX = '.+'

Then interpolated into a regexp:

regex = /^(#{ DATE_REGEX })_(#{ SERIAL_REGEX })_(#{ TITLE_REGEX })/
# => /^(\d{4}-[A-Z]{3}-\d{2})_(\d{2})_(.+)/

The advantage to that is it's easier to maintain because the pattern is really several smaller ones.

str.match(regex) # => #<MatchData "1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3" 1:"1952-FEB-21" 2:"70" 3:"The_Case_of_the_Gold_Ring.mp3">
regex.match(str) # => #<MatchData "1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3" 1:"1952-FEB-21" 2:"70" 3:"The_Case_of_the_Gold_Ring.mp3">

are equivalent because both Regexp and String implement match.

We can retrieve what was captured as an array:

regex.match(str).captures # => ["1952-FEB-21", "70", "The_Case_of_the_Gold_Ring.mp3"]
regex.match(str).captures.last # => "The_Case_of_the_Gold_Ring.mp3"

We can also name the captures and access them like we would a hash:

regex = /^(?<date>#{ DATE_REGEX })_(?<serial>#{ SERIAL_REGEX })_(?<title>#{ TITLE_REGEX })/
matches = regex.match(str)
matches[:date] # => "1952-FEB-21"
matches[:serial] # => "70"
matches[:title] # => "The_Case_of_the_Gold_Ring.mp3"

Of course, it's not necessary to mess with that rigamarole at all. We can split the string on underscores ('_'):

str = "1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3"
str.split('_') # => ["1952-FEB-21", "70", "The", "Case", "of", "the", "Gold", "Ring.mp3"]

split can take a limit parameter saying how many times it should split the string. Passing in 3 gives us:

str.split('_', 3) # => ["1952-FEB-21", "70", "The_Case_of_the_Gold_Ring.mp3"]

Grabbing the last element returns:

str.split('_', 3).last # => "The_Case_of_the_Gold_Ring.mp3"

Comments

0

I believe it would be easiest to use a capture group here, but I'd like to present some possibilities that do not, for illustrative purposes. All employ the same positive lookahead ((?=\.mp3$)). all but one use a positive lookbehind and one uses \K to "forget" the match up to the last character before beginning of the desired match. Some permit the matched string to contain digits (.+); others do not ([^\d]).

str = "1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3"

1 # match follows last digit followed by underscore, cannot contain digits 
str[/(?<=\d_)[^\d]+(?=\.mp3$)/]    
  #=> "The_Case_of_the_Gold_Ring"

2 # same as 1, as `\K` disregards match to that point
str[/\d_\K[^\d]+(?=\.mp3$)/]
  #=> "The_Case_of_the_Gold_Ring"

3 # match follows underscore, two digits, underscore, may contain digits 
str[/(?<=_\d\d_).+(?=\.mp3$)/]
  #=> "The_Case_of_the_Gold_Ring"

4 # match follows string having specfic pattern, may contain digits
str[/(?<=\d{4}-[A-Z]{3}-\d{2}_\d{2}_).+(?=\.mp3$)/]
  #=> "The_Case_of_the_Gold_Ring"

5 # match follows digit, any 12 characters, another digit and underscore,
  # may contain digits
str[/(?<=\d.{12}\d_).+(?=\.mp3$)/]
  #=> "The_Case_of_the_Gold_Ring"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.