0

I'm using Oracle REGEXP_REPLACE to remove all non standard ascii (special) characters from a CLOB freetext field a client uses to store data, some users copy and paste from Excel which causes issues when reading the text from the CLOB. I've managed to get rid of all special characters and maintain the layout for paragraphs etc.

The issue I'm having is when I compare the cleansed field to the original, after converting to char string of course, is it is returning rows that have no difference between them, I've pulled out both values and there is no difference in most cases. It's returning the whole dataset instead of just those that have been cleansed.I've run out of ideas

FYI a big part of maintaining the layout means I need to keep the carriage return CHR(13)

Below is the query

select *
from (
select incident, entity,
trim(to_char(textarea)) textarea,
trim(to_char(regexp_replace(textarea,'[^' || CHR(13) || CHR(32) || '-' || CHR(126) || ']',''))) regexp_text
from response
) tab
where tab.regexp_text <> tab.textarea

I've tried a number of different combinations with trimming whitespace and substring smaller strings etc. with no luck

3
  • Can you please post some example data, showing the expected and the actual result? Commented Feb 22, 2016 at 12:10
  • @Aleksej The expected result would be strings that don't match i.e éklzéiuhkn and klziuhkn except it's also returning strings that match as well as ones that don't Commented Feb 22, 2016 at 12:23
  • You could use stackoverflow.com/questions/22533037/… to calculate the MD5 checksum for both result columns, to see if they are really identical. Commented Feb 22, 2016 at 12:51

3 Answers 3

0

Your pattern is: '[^' || CHR(13) || CHR(32) || '-' || CHR(126) || ']'.

The caret (^) is replacing everything that is not a special character with an empty string. With Oracle's default settings, I would expect this to return the empty string, which is treated as a NULL -- but that would mean nothing is returned.

In any case, try removing the caret:

trim(to_char(regexp_replace(textarea, '[CHR(13) || CHR(32) || '-' || CHR(126) || ']',' '))) as regexp_text
Sign up to request clarification or add additional context in comments.

2 Comments

This would not explain the described behaviour, though.
@GordonLinoff When i remove the caret it does the opposite and removes all characters except special
0
WITH data ( value ) AS (
  SELECT 'éklzéiuhkn' FROM DUAL
)
SELECT REGEXP_REPLACE( value, '[^\r -~]', '' )
FROM   data;

Outputs:

REGEXP_R
--------
klziuhkn

So your query should be something like:

SELECT *
FROM   (
  SELECT incident,
         entity,
         TRIM( textarea ) AS textarea,
         TRIM( REGEXP_REPLACE( textarea, '[^\r -~]', '' ) ) AS regexp_text
  FROM   response
)
WHERE textarea <> regexp_text;

2 Comments

I need to maintain the carriage return
@KevinT swapped from \n to \r.
0

I finally figured it out, the issue was in formatting the comparisson strings in the where clause to include all Ascii characters and match against the regexp I planned to use, it was the only way to eliminate bogus and invisible encoded characters.

select incident, entity,
regexp_replace(textarea,'[^' || CHR(13) || ' -}‘-~À-ü]',''))) regexp_text
regexp_replace(textarea,'[^' || CHR(13) || CHR(32) || '-' || CHR(125) || CHR(145) || '-' || CHR(152) || CHR(192) || '-' || CHR(252) || ']','') regexp_text2
from response
where to_char(regexp_replace(textarea,'[^' || CHR(163) || CHR(1) || '-' || CHR(125) || CHR(145) || '-' || CHR(152) || CHR(192) || '-' || CHR(252) || ']','')) <> to_char(regexp_replace(textarea,'[^' || CHR(1) || '-' || CHR(255) || ']',''))

I included both lines to show the simple regexp of '[^' || CHR(13) || ' -}‘-~À-ü]' as well as the CHR(x) version. I also needed to include latin characters in the end.

For some reason using only CHR() when specifing ascii characters works 100% of the time, I guess it might have something to do with the environments NLS_LANG setting.

This should work for all those looking to exclude weird encoded characters from their strings

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.