Oracle SQL: Extracting multiple text between two characters

Question

i have table like below :

|-------------|---------------------------------------------------|
|ID.          | CONTENT                                           |
|-------------|---------------------------------------------------|
|1            |<TITLE> <SUB-TITLE-1> Content <SUB-TITLE-2>Content.
|2            |<TITLE> <SUB-TITLE-1> Content <SUB-TITLE-2>Content.
|3            |<TITLE> <SUB-TITLE-1> Content <SUB-TITLE-2>Content. <SUB-TITLE-3> Content
|-------------|---------------------------------------------------|

I want to extract all text in between <>, so it will become like below :

|-------------|-------------------------------------------------|
|ID.          | CONTENT                                         |
|-------------|-------------------------------------------------|
|1            |TITLE                                            |
|1            |SUB-TITLE-1                                      |
|1            |SUB-TITLE-2                                      |
|2            |TITLE                                            |
|2            |SUB-TITLE-1                                      |
|2            |SUB-TITLE-2                                      |
|3            |TITLE                                            |
|3            |SUB-TITLE-1                                      |
|3            |SUB-TITLE-2                                      | 
|3            |SUB-TITLE-3                                      |
|-------------|-------------------------------------------------|

How to achieve this ? I'm trying to do by regex, but I think I'm lost.. My Oracle version is 18c, if that's help...

Josh Eller · Accepted Answer · 2020-05-18 16:11:32Z

3

You can use the 4th argument of REGEXP_SUBSTR to specify an occurrence for matching.

To get a row for the 1st, 2nd, and 3rd occurrence, you can cross-join with a sub-query from dual.

WITH test_data AS (
    SELECT 1 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.<A third sub-title>' AS content_data FROM dual UNION
    SELECT 2 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.' AS content_data FROM dual
)

SELECT t.content_id,
       REGEXP_SUBSTR(t.content_data, '<(.*?)>', 1, s.match_occurrence, 'i', 1) AS content_match
FROM test_data t
CROSS JOIN (
    SELECT 1 AS match_occurrence FROM dual UNION
    SELECT 2 AS match_occurrence FROM dual UNION
    SELECT 3 AS match_occurrence FROM dual UNION
    SELECT 4 AS match_occurrence FROM dual
    /* ... etc, with the number of rows equal to the maximum number of matches that can appear */
) s
WHERE REGEXP_SUBSTR(t.content_data, '<.*?>', 1, s.match_occurrence) IS NOT NULL /* Only return records that have a match for the given occurrence */
ORDER BY t.content_id, s.match_occurrence

Borrowing the CONNECT_BY_LEVEL from Barbaros' excellent answer, you could do it more concisely as:

WITH test_data AS (
    SELECT 1 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.<A third sub-title>' AS content_data FROM dual UNION
    SELECT 2 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.' AS content_data FROM dual
)

SELECT t.content_id,
       REGEXP_SUBSTR(t.content_data, '<(.*?)>', 1, LEVEL, 'i', 1) AS content_match
FROM test_data t
CONNECT BY 
   LEVEL <= REGEXP_COUNT(t.content_data, '<.*?>')
   AND PRIOR sys_guid() IS NOT NULL
   AND PRIOR content_id = content_id
ORDER BY t.content_id, LEVEL

Note that the CONNECT_BY_LEVEL method might be slower on large datasets, so I would avoid that if performance is a concern.

edited May 18, 2020 at 16:11

answered May 18, 2020 at 14:51

Josh Eller

2,0859 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Gerry Gry Over a year ago

thanks Josh, i forgot to tell that the subtitle not always come with 2 sub-title, there might be case for 3 or 4 sub-title, does it mean i need to multiply the match_occurrence ?

Josh Eller Over a year ago

@GerryGry Sure, I updated the answer. You can just add more rows until you've got as many as the maximum number of matches you can have. If you want to not return rows without a match, you can filter for that by making sure the match IS NOT NULL.

Barbaros Özhan · Accepted Answer · 2020-05-18 15:24:02Z

3

One option would be using instr() and substr() functions together within a

SELECT .. FROM ..CONNECT BY level style query in order to repeat through counting the numbers of > (or <) signs within each strings :

SELECT id, substr(content,
                  instr(content,'<',1,level)+1,
                  instr(content,'>',1,level)-instr(content,'<',1,level)-1) as content
  FROM tab
CONNECT BY level <= regexp_count(content,'>')
   AND PRIOR sys_guid() IS NOT NULL
   AND PRIOR id = id

Demo

answered May 18, 2020 at 15:24

Barbaros Özhan

65.9k11 gold badges36 silver badges64 bronze badges

3 Comments

Himanshu Over a year ago

Good answer Barbaros I was trying to do the same your answer is more apt than mine could you please take a sec and just recommend few modifications in my answer

Barbaros Özhan Over a year ago

hm I can tell you that you don't need to count the inequality signs through number of queries, thank you @HimanshuAhuja .

Josh Eller Over a year ago

Hope you don't mind, but I borrowed your connect by level usage to make my answer a bit more concise. Let me know if you'd like me to take that section out!

Himanshu · Accepted Answer · 2020-05-18 15:17:09Z

0

I had tried the more conventional way using SUBSTR and INSTR

     With data as 
      Select column1, 
      Trim(CONTENT, '<', '>') as col2 FROM
         TABLE
    WITH subdata as
   ( Select column1, 
     SUBSTR(col2,0, INSTR(col2, ' ')) 
    as s1
     from 
     data) t1
     Union
     ( Select t1.column1 as col1,
     SUBSTR(col2, Length(t1.s1)+1
           INSTR(
           SUBSTR(
            t1.col2, Length(t1.s1)+1,  
           LENGTH(col2)), ' ')))  as col2
      From
     data) t3
      Union
       ........ t3.... t4

      From table

answered May 18, 2020 at 15:17

Himanshu

3,9802 gold badges16 silver badges34 bronze badges

Comments

marcothesane · Accepted Answer · 2020-05-18 16:00:19Z

Perfect approach, @Josh Eller !

Only that @Gerry Gry needs it without the greater/smaller signs.

Try with grouping using parentheses?

WITH test_data(content_id,content_data) AS (
      SELECT 1 , '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.<SUB-TITLE-3>Content.' FROM dual
UNION SELECT 2 , '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.' FROM dual
)

SELECT t.content_id
     , match_occurrence
     , REGEXP_SUBSTR(
         t.content_data     -- input string
       , '[<]([^>]*)[>]'    -- regex
       , 1                  -- starting position
       , s.match_occurrence -- n-th occurrence
       , ''                 -- regexp modifier
       , 1                  -- captured-subexp
       ) AS content_match
FROM test_data t
CROSS JOIN (
      SELECT 1 FROM dual 
UNION SELECT 2 FROM dual 
UNION SELECT 3 FROM dual 
UNION SELECT 4 FROM dual
) s(match_occurrence)
WHERE 
  REGEXP_SUBSTR(
         t.content_data     -- input string
       , '[<]([^>]*)[>]'    -- regex
       , 1                  -- starting position
       , s.match_occurrence -- n-th occurrence
       , ''                 -- regexp modifier
       , 1                  -- captured-subexp
  )  
  IS NOT NULL
ORDER BY t.content_id, s.match_occurrence
;
-- out Time: First fetch (0 rows): 0.656 ms. All rows formatted: 0.667 ms
-- out  content_id | match_occurrence | content_match 
-- out ------------+------------------+---------------
-- out           1 |                1 | TITLE
-- out           1 |                2 | SUB-TITLE-1
-- out           1 |                3 | SUB-TITLE-2
-- out           1 |                4 | SUB-TITLE-3
-- out           2 |                1 | TITLE
-- out           2 |                2 | SUB-TITLE-1
-- out           2 |                3 | SUB-TITLE-2
-- out (7 rows)
-- out 
-- out Time: First fetch (7 rows): 29.904 ms. All rows formatted: 29.947 ms

Collectives™ on Stack Overflow

Oracle SQL: Extracting multiple text between two characters

4 Answers 4

2 Comments

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related