1

I have a simple function to convert html blob to plain text

 FUNCTION HTML_TO_TEXT(html IN CLOB) RETURN CLOB
 IS v_return CLOB;
   BEGIN     
select utl_i18n.unescape_reference(regexp_replace(html, '<.+?>', ' ')) INTO v_return from dual;
return (v_return);
    END;

called in that way:

SELECT A, B, C, HTML_TO_TEXT(BLobField) FROM t1

all works fine until BlobFields contains more than 4000 character, then i got

ORA-01704: string literal too long
01704. 00000 -  "string literal too long"
*Cause:    The string literal is longer than 4000 characters.
*Action:   Use a string literal of at most 4000 characters.
           Longer values may only be entered using bind variables.

i try to avoud string inside function using variables but nothing changes:

FUNCTION HTML_TO_TEXT(html IN CLOB) RETURN CLOB
 IS v_return CLOB;
 "stringa" CLOB;
   BEGIN 
    SELECT regexp_replace(html, '<.+?>', ' ') INTO "stringa" FROM DUAL;
    select utl_i18n.unescape_reference("stringa") INTO v_return from dual;
    return (v_return);
END;
2
  • 1
    The unescape_reference function take a varchar2 argument, not a CLOB. Does your HTML include line breaks with each line shorter than 4000 bytes - in which case you could chop it into individual lines, unescape those, and stick back together as a CLOB? Commented Jun 21, 2022 at 13:04
  • i can have spaces in html or maybe some <br> Commented Jun 21, 2022 at 13:08

1 Answer 1

1

Do not use regular expressions to parse HTML. If you want to extract the text then use an XML parser:

SELECT a,
       b,
       c,
       UTL_I18N.UNESCAPE_REFERENCE(
         XMLQUERY(
           '//text()'
           PASSING XMLTYPE(blobfield, 1)
           RETURNING CONTENT
         ).getStringVal()
       ) AS text
FROM   t1

Which will work where the extracted text is 4000 characters or less (since XMLTYPE.getStringVal() will return a VARCHAR2 data type and UTL_I18N.UNESCAPE_REFERENCE accepts a VARCHAR2 argument).


If you want to get it to work on CLOB values then you can still use XMLQUERY and getClobVal() but UTL_I18N.UNESCAPE_REFERENCE still only works on VARCHAR2 input (and not CLOBs) so you will need to split the CLOB into segments and parse those and concatenate them once you are done.

Something like:

CREATE FUNCTION html_to_text(
  i_xml IN XMLTYPE
) RETURN CLOB
IS
  v_text   CLOB;
  v_output CLOB;
  str      VARCHAR2(4000);
  len      PLS_INTEGER;
  pos      PLS_INTEGER := 1;
  lim      CONSTANT PLS_INTEGER := 4000;
BEGIN
  SELECT XMLQUERY(
           '//text()'
           PASSING i_xml
           RETURNING CONTENT
         ).getStringVal()
  INTO   v_text
  FROM   DUAL;
  
  len := LENGTH(v_text);
  WHILE pos <= len LOOP
    str := DBMS_LOB.SUBSTR(v_text, lim, pos);
    v_output := v_output || UTL_I18N.UNESCAPE_REFERENCE(str);
    pos := pos + lim;
  END LOOP;
  
  RETURN v_output;
END;
/

However, you probably want to make it more robust and check if you are going to split the string in the middle of an escaped XML character.

db<>fiddle here

Sign up to request clarification or add additional context in comments.

4 Comments

found stackoverflow.com/questions/53914336/… to split but give me some error: link stringBuffer undeclared and then "numeric or value error)
got ORA-06553: PLS-307: too many declarations of 'XMLTYPE' match this call 06553. 00000 - "PLS-%s: %s". it's because i have clob while you use blob field for table
found XMLTYPE.createXML but not working on my text since is not correct xml: may have only plain text without any tag
@gt.guybrush Then add a tag to make it valid XML: '<root>' || :your_string || '</root>'

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.