0

i have table like below :

|-------------|---------------------------------------------------|
|ID.          | CONTENT                                           |
|-------------|---------------------------------------------------|
|1            |<TITLE> <SUB-TITLE-1> Content <SUB-TITLE-2>Content.
|2            |<TITLE> <SUB-TITLE-1> Content <SUB-TITLE-2>Content.
|3            |<TITLE> <SUB-TITLE-1> Content <SUB-TITLE-2>Content. <SUB-TITLE-3> Content
|-------------|---------------------------------------------------|

I want to extract all text in between <>, so it will become like below :

|-------------|-------------------------------------------------|
|ID.          | CONTENT                                         |
|-------------|-------------------------------------------------|
|1            |TITLE                                            |
|1            |SUB-TITLE-1                                      |
|1            |SUB-TITLE-2                                      |
|2            |TITLE                                            |
|2            |SUB-TITLE-1                                      |
|2            |SUB-TITLE-2                                      |
|3            |TITLE                                            |
|3            |SUB-TITLE-1                                      |
|3            |SUB-TITLE-2                                      | 
|3            |SUB-TITLE-3                                      |
|-------------|-------------------------------------------------|

How to achieve this ? I'm trying to do by regex, but I think I'm lost.. My Oracle version is 18c, if that's help...

4 Answers 4

3

You can use the 4th argument of REGEXP_SUBSTR to specify an occurrence for matching.

To get a row for the 1st, 2nd, and 3rd occurrence, you can cross-join with a sub-query from dual.

WITH test_data AS (
    SELECT 1 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.<A third sub-title>' AS content_data FROM dual UNION
    SELECT 2 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.' AS content_data FROM dual
)

SELECT t.content_id,
       REGEXP_SUBSTR(t.content_data, '<(.*?)>', 1, s.match_occurrence, 'i', 1) AS content_match
FROM test_data t
CROSS JOIN (
    SELECT 1 AS match_occurrence FROM dual UNION
    SELECT 2 AS match_occurrence FROM dual UNION
    SELECT 3 AS match_occurrence FROM dual UNION
    SELECT 4 AS match_occurrence FROM dual
    /* ... etc, with the number of rows equal to the maximum number of matches that can appear */
) s
WHERE REGEXP_SUBSTR(t.content_data, '<.*?>', 1, s.match_occurrence) IS NOT NULL /* Only return records that have a match for the given occurrence */
ORDER BY t.content_id, s.match_occurrence

Borrowing the CONNECT_BY_LEVEL from Barbaros' excellent answer, you could do it more concisely as:

WITH test_data AS (
    SELECT 1 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.<A third sub-title>' AS content_data FROM dual UNION
    SELECT 2 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.' AS content_data FROM dual
)

SELECT t.content_id,
       REGEXP_SUBSTR(t.content_data, '<(.*?)>', 1, LEVEL, 'i', 1) AS content_match
FROM test_data t
CONNECT BY 
   LEVEL <= REGEXP_COUNT(t.content_data, '<.*?>')
   AND PRIOR sys_guid() IS NOT NULL
   AND PRIOR content_id = content_id
ORDER BY t.content_id, LEVEL

Note that the CONNECT_BY_LEVEL method might be slower on large datasets, so I would avoid that if performance is a concern.

Sign up to request clarification or add additional context in comments.

2 Comments

thanks Josh, i forgot to tell that the subtitle not always come with 2 sub-title, there might be case for 3 or 4 sub-title, does it mean i need to multiply the match_occurrence ?
@GerryGry Sure, I updated the answer. You can just add more rows until you've got as many as the maximum number of matches you can have. If you want to not return rows without a match, you can filter for that by making sure the match IS NOT NULL.
3

One option would be using instr() and substr() functions together within a

SELECT .. FROM ..CONNECT BY level style query in order to repeat through counting the numbers of > (or <) signs within each strings :

SELECT id, substr(content,
                  instr(content,'<',1,level)+1,
                  instr(content,'>',1,level)-instr(content,'<',1,level)-1) as content
  FROM tab
CONNECT BY level <= regexp_count(content,'>')
   AND PRIOR sys_guid() IS NOT NULL
   AND PRIOR id = id

Demo

3 Comments

Good answer Barbaros I was trying to do the same your answer is more apt than mine could you please take a sec and just recommend few modifications in my answer
hm I can tell you that you don't need to count the inequality signs through number of queries, thank you @HimanshuAhuja .
Hope you don't mind, but I borrowed your connect by level usage to make my answer a bit more concise. Let me know if you'd like me to take that section out!
0

I had tried the more conventional way using SUBSTR and INSTR

     With data as 
      Select column1, 
      Trim(CONTENT, '<', '>') as col2 FROM
         TABLE
    WITH subdata as
   ( Select column1, 
     SUBSTR(col2,0, INSTR(col2, ' ')) 
    as s1
     from 
     data) t1
     Union
     ( Select t1.column1 as col1,
     SUBSTR(col2, Length(t1.s1)+1
           INSTR(
           SUBSTR(
            t1.col2, Length(t1.s1)+1,  
           LENGTH(col2)), ' ')))  as col2
      From
     data) t3
      Union
       ........ t3.... t4

      From table

Comments

0

Perfect approach, @Josh Eller !

Only that @Gerry Gry needs it without the greater/smaller signs.

Try with grouping using parentheses?

WITH test_data(content_id,content_data) AS (
      SELECT 1 , '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.<SUB-TITLE-3>Content.' FROM dual
UNION SELECT 2 , '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.' FROM dual
)

SELECT t.content_id
     , match_occurrence
     , REGEXP_SUBSTR(
         t.content_data     -- input string
       , '[<]([^>]*)[>]'    -- regex
       , 1                  -- starting position
       , s.match_occurrence -- n-th occurrence
       , ''                 -- regexp modifier
       , 1                  -- captured-subexp
       ) AS content_match
FROM test_data t
CROSS JOIN (
      SELECT 1 FROM dual 
UNION SELECT 2 FROM dual 
UNION SELECT 3 FROM dual 
UNION SELECT 4 FROM dual
) s(match_occurrence)
WHERE 
  REGEXP_SUBSTR(
         t.content_data     -- input string
       , '[<]([^>]*)[>]'    -- regex
       , 1                  -- starting position
       , s.match_occurrence -- n-th occurrence
       , ''                 -- regexp modifier
       , 1                  -- captured-subexp
  )  
  IS NOT NULL
ORDER BY t.content_id, s.match_occurrence
;
-- out Time: First fetch (0 rows): 0.656 ms. All rows formatted: 0.667 ms
-- out  content_id | match_occurrence | content_match 
-- out ------------+------------------+---------------
-- out           1 |                1 | TITLE
-- out           1 |                2 | SUB-TITLE-1
-- out           1 |                3 | SUB-TITLE-2
-- out           1 |                4 | SUB-TITLE-3
-- out           2 |                1 | TITLE
-- out           2 |                2 | SUB-TITLE-1
-- out           2 |                3 | SUB-TITLE-2
-- out (7 rows)
-- out 
-- out Time: First fetch (7 rows): 29.904 ms. All rows formatted: 29.947 ms

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.