2

I'm wondering if i can use regex in BigQuery to extract all the numbers from a string.

I think the below works but just returns first hit - is there a way to extract all the hits.

My use case here is that i basically want to get the biggest number from a url as that tends to be more like a post_id that i need to join on.

here is an example of what i am talking about:

SELECT
  mystr,
  REGEXP_EXTRACT(mystr, r'(\d+)') AS nums
FROM
  (SELECT 'this is a string with some 666 numbers 999 in it 333' AS mystr),
  (SELECT 'just one number 123 in this one ' AS mystr),
  (SELECT '99' AS mystr),
  (SELECT 'another -2 example 99' AS mystr),
  (SELECT 'another-8766 example 99' AS mystr),
  (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999' AS mystr),
  (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/gallery/001' AS mystr),
  (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/print-preview' AS mystr)

Results i get from this are:

[
  {
    "mystr": "this is a string with some 666 numbers 999 in it 333",
    "nums": "666"
  },
  {
    "mystr": "just one number 123 in this one ",
    "nums": "123"
  },
  {
    "mystr": "99",
    "nums": "99"
  },
  {
    "mystr": "another -2 example 99",
    "nums": "2"
  },
  {
    "mystr": "another-8766 example 99",
    "nums": "8766"
  },
  {
    "mystr": "http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999",
    "nums": "2015"
  },
  {
    "mystr": "http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/gallery/001",
    "nums": "2015"
  },
  {
    "mystr": "http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/print-preview",
    "nums": "2015"
  }
]

2 Answers 2

9

After a bit of digging, I ended up with this solution:

SELECT
  mystr,
  GROUP_CONCAT(SPLIT(REGEXP_REPLACE(mystr, r'[^\d]+', ','))) AS nums
FROM
  (SELECT 'this is a string with some 666 numbers 999 in it 333' AS mystr),
  (SELECT 'just one number 123 in this one ' AS mystr),
  (SELECT '99' AS mystr),
  (SELECT 'another -2 example 99' AS mystr),
  (SELECT 'another-8766 example 99' AS mystr),
  (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999' AS mystr),
  (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/gallery/001' AS mystr),
  (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/print-preview' AS mystr)

enter image description here

How it works:

  • I first use regex to match any non number and replace by a comma
  • Then use split to get the results, empty results are discarded
  • group_concat is just here to show the results
Sign up to request clarification or add additional context in comments.

1 Comment

Notice that it fails for floats or negative numbers ;), but it can be improved easily.
1

While you will be using Regex in BigQuery more and more you will realize that its implementation is quite limited as of now
BigQuery Regular expression functions
re2 Syntax

So most likely soon you will have to do something like below
Please note - for your current specific example - below code has absolutely no benefits vs simple solution provided by @Cybril
This solution is more for your potential needs in near future
It uses javascript UDF thus giving you power of javascript regexp implementation
BigQuery User-Defined Functions

SELECT mystr, MAX(number) as max_number FROM JS(
  // input table
  (SELECT mystr FROM
    (SELECT 'this is a string with some 666 numbers 999 in it 333' AS mystr),
    (SELECT 'just one number 123 in this one ' AS mystr),
    (SELECT '99' AS mystr),
    (SELECT 'another -2 example 99' AS mystr),
    (SELECT 'another-8766 example 99' AS mystr),
    (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999' AS mystr),
    (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/gallery/001' AS mystr),
    (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/print-preview' AS mystr)
  ) ,
  // input columns
    mystr,
  // output schema
  "[
  {name: 'mystr', type: 'string'},
  {name: 'number', type: 'string'}
  ]",
  // function
  "function(r, emit){
    var numbers = r.mystr.match(/(\d+)/g);
    for (var i=0; i < numbers.length; i++) {
      emit({
        mystr: r.mystr,
        number: numbers[i]
      });
    };  
  }"
)
GROUP BY 1

Of course you can also move logic of determining max value inside UDF to eliminate extra grouping

1 Comment

Thanks - was thinking i may need to go UDF approach but i'm still only learning in that regard. Thanks for sharing. I Think some of the BQ team mentioned they are working on a catalog of functions that they hope to release at some stage so looking forward to that also.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.