1

I heard that URI::extract() only returns links with a :, however since I am grabbing a tweet, and it does not contain a :, I believe I would have to use a regular expression. I need to check for a "swoo.sh/whatever" link, and store it to a variable. However, how could I look for the first (which it returns automatically apparently), "swoo.sh/whatever" link, in regards to that I have to maintain everything after the /. For example, if the tweet says

Lorem ipsum lorem ipsum swoo.sh/12xfsW lorem ipsum

How would I grab the swoo.sh link, and all the different things that come directly after the /?

2
  • Is swoo.sh fixed? Commented May 9, 2018 at 4:18
  • I would assume such links are clickable on twitter, which means the original HTML would have the actual URI in it, making this task trivial. Are you sure you can't use a different API/scraper to get the actual HTML content of the tweet? Commented May 9, 2018 at 13:55

2 Answers 2

1

Here is one approach using match:

match = /(\w+\.\w+\/\w+)/.match("Lorem ipsum lorem ipsum swoo.sh/12xfsW lorem ipsum")
if match
    puts match[1]
else
    puts "no match"
end

Demo

If you also need the simultaneous ability to capture full URLs, then my answer would have to be updated. This only answers your immediate question.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. You answering my immediate question was all that was needed!
1

We can use the fact that URIs can't contain spaces and Ruby has URI::Generic which will parse almost anything that looks URI-ish. Then we just need to filter out non-web-URIs, which I do by assuming that every web URI has to start with something like foo.bar

require 'uri'
require 'pathname'

tweet.
  split.
  map { |s| URI.parse(s) rescue nil }.
  select { |u| u && (u.hostname || Pathname(u.path).each_filename.first =~ /\w\.\w/) }

Example output

tweet = 'foo . < google.com bar swoosh.sh/blah?q=bar http://google.com/bar'
# the above returns
# [#<URI::Generic google.com>, #<URI::Generic swoosh.sh/blah?q=bar>, #<URI::HTTP http://google.com/bar>]

This can't really work in general because of ambiguity. "car.net" looks like a shortened link, but in context it could be "my neighbor threw a baseball through my window so i yanked the hubcabs off his car.net gain!!!", where it's clearly just a missing space.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.