1

I need to extract information from a malformed json-formatted string. Due to an error, all quotes have been stripped, so instead of:

{"receipt_type":"Production","adam_id":1233,"app_item_id":1233,"more":[{"h":234}]}

I have:

 {receipt_type:Production,adam_id:1233,app_item_id:1233,more:[{h:234}]}

Using

replace(replace(replace(replace(replace(replace("transactionReceipt", ':', '":"'),',','","'),'{','{"'),'}','"}'),':"[{',':[{'), ']"}', ']}')

I can get mostly fine json, HOWEVER this incorrectly turns "receipt_creation_date":"2021-01-24 03:11:53 Etc/GMT" into "receipt_creation_date":"2021-01-24 03":"11":"53 Etc/GMT"

Using

REGEXP_REPLACE(replace(replace(replace(replace(replace(replace("transactionReceipt", ':', '":"'),',','","'),'{','{"'),'}','"}'),':"[{',':[{'), ']"}', ']}'),'\d\d\"\:\"\d\d\"\:\"\d\d','','g')

I can match the 03":"11":"53 and remove it totally, yielding correct JSON.

But I want to retain the time, meaning that I swap ":" with : but ONLY if ":" is within the regex pattern of '\d\d\"\:\"\d\d\"\:\"\d\d'?

Thank you!

edit: a complete malformed record (it's actually an IAP transaction receipt from Apple) looks like

{receipt_type:Production,adam_id:123456789,app_item_id:123456789,bundle_id:example.app.yo,application_version:101,download_id:123456789,version_external_identifier:123456789,receipt_creation_date:2020-03-19 16:18:27 Etc/GMT,receipt_creation_date_ms:1584634707000,receipt_creation_date_pst:2020-03-19 09:18:27 America/Los_Angeles,request_date:2020-03-19 16:18:28 Etc/GMT,request_date_ms:1584634708876,request_date_pst:2020-03-19 09:18:28 America/Los_Angeles,original_purchase_date:2020-02-28 17:55:33 Etc/GMT,original_purchase_date_ms:1582912533000,original_purchase_date_pst:2020-02-28 09:55:33 America/Los_Angeles,original_application_version:101,in_app:[{quantity:1,product_id:com.eample.iap.id,transaction_id:123456789,original_transaction_id:123456789,purchase_date:2020-03-19 16:18:27 Etc/GMT,purchase_date_ms:1584634707000,purchase_date_pst:2020-03-19 09:18:27 America/Los_Angeles,original_purchase_date:2020-03-19 16:18:27 Etc/GMT,original_purchase_date_ms:1584634707000,original_purchase_date_pst:2020-03-19 09:18:27 America/Los_Angeles,is_trial_period:false}]}

1
  • Geez. Why can't Apple simply follow established standards and provide that as a valid JSON structure? Commented Jan 27, 2021 at 10:27

3 Answers 3

1

I would use two nested regexp_replace() calls:

  • first to enclose all keys with double quotes
  • then to enclose all values with double quotes
with data (input) as (
  values ('{receipt_type:Production,adam_id:1233,app_item_id:1233,more:[{h:234}]}')
)
select regexp_replace(
          regexp_replace(input, '(\w+)(:)', '"\1"\2', 'g'),
          '(:)(\w+)(,|})', '\1"\2"\3', 'g')
from data;          

returns:

{"receipt_type":"Production","adam_id":"1233","app_item_id":"1233","more":[{"h":"234"}]}

If you want have the integers without quotes, you could run a third regexp_replace that removes the quotes from integers with quotes:

regexp_replace(..., '"([0-9]+)"', '\1', 'g')
3
  • Hi, thank you! I wrapped the three replaces, but it results in this:{"receipt_creation_date":2020-03-19 16:18:27 Etc/GMT,"receipt_creation_date_ms":1584634707000,...} so the date/time string doesn't get quotes for receipt_creation_date :( Commented Jan 26, 2021 at 16:47
  • @phikappa: you probably need a first step that only applies to patterns for timestamps Commented Jan 26, 2021 at 17:07
  • @a_horse_with_no_name - I added an extension (not sure what to call it) to your answer in a revised one of my own - if you have any feedback, that would be great! Commented Jan 27, 2021 at 16:55
0

the answer by @a_horse_with_no_name got me onto the right path. way I solved it was to change my previous attempt REGEXP_REPLACE(replace(replace(replace(replace(replace(replace("transactionReceipt", ':', '":"'),',','","'),'{','{"'),'}','"}'),':"[{',':[{'), ']"}', ']}'),'\d\d\"\:\"\d\d\"\:\"\d\d','','g') with this: REGEXP_REPLACE(replace(replace(replace(replace(replace(replace("transactionReceipt", ':', '":"'),',','","'),'{','{"'),'}','"}'),':"[{',':[{'), ']"}', ']}'),'(\d\d)(":")(\d\d)(":")(\d\d)','\1:\3:\5','g')

note the parenthesis enclosing the blocks of (\d\d) & (":") so I can backreference them in the replacement regex. this gave correct JSON, albeit with quotes around integers. I then wrapped everything in the third component regexp_replace(..., '"([0-9]+)"', '\1', 'g') to yield perfect json.

4
  • Maybe you should give us the data (a few records) and we can possibly provide a regex that tackles the underlying problem - all of those REPLACE(REPLACE(...s hurt my eyes! :-) @a_horse_with_no_name's answer was pretty good - see here. p.s. welcome to the forum! Commented Jan 26, 2021 at 18:54
  • Thank you @Vérace ! I know all the replaces are a bit cumbersome, the issue with the first part of @a_horse_with_no_name was that it leaves some values without quotes, like so "request_date":2020-03-19 16:18:28 Etc/GMT but yeah not super elegant. I added an example record below Commented Jan 27, 2021 at 9:20
  • Where is the sample record that you "added below"? Commented Jan 27, 2021 at 9:23
  • actually edited my original question with the example Commented Jan 27, 2021 at 9:24
0

I have extensively revised my answer (and deleted the old one!) - it's quite simple really - just requires a lot of patience and f&*#-ing (that's fiddling) about - regexes really are fussy!

@a_horse_with_no_name's answer gives you this (see the fiddle here):

SELECT SUBSTR( regexp_replace(
          regexp_replace(j_str, '(\w+)(:)', '"\1"\2', 'g'),
          '(:)(\w+)(,|})', '\1"\2"\3', 'g'), 198)
from test;

Result:

.... 9","receipt_creation_date":2020-03-19 "16":"18":27 Etc/GMT,"rec...

SUBSTR is for testing.

So, the challenge is to convert

...te":2020-03-19 "16":"18":27 Etc/GMT,"r...

into

...te":"2020-03-19 16:18:27 Etc/GMT","r...

which I did in the following way:

SELECT
SUBSTRING(
  REGEXP_REPLACE
  (
    REGEXP_REPLACE
    (
      REGEXP_REPLACE
      (
        j_str, 
        '(\w+)(:)', 
        '"\1"\2', 'g'
      ),
      '(:)(\w+)(,|})', '\1"\2"\3', 'g'
    ),
    '(":)(\d{4,4}-\d{2,2}-\d{2,2}) "(\d{2,2})":"(\d{2,2})":(\d{2,2}) (\w+/\w+),"',
    '\1"\2 \3:\4:\5 \6","', 'g' 
  ), 200
)
FROM test;

Result (line breaks inserted for legibilty):

,"receipt_creation_date":"2020-03-19 16:18:27Etc/GMT",

"receipt_creation_date_ms":"1584634707000",

"receipt_creation_date_pst":"2020-03-19 09:18:27 America/Los_Angeles",
"request_date":"2020-03-19 16:18:28 Etc/GMT",

"request_date_ms":"1584634708876","request_date_pst":"2020-03-19 09:18:28 America/Los_Angeles",

"original_purchase_date":"2020-02-28 17:55:33 Etc/GMT",

"original_purchase_date_ms":"1582912533000",

"original_purchase_date_pst":"2020-02-28 09:55:33 America/Los_Angeles",

"original_application_version":"101","in_app":[{"quantity":"1","product_id":com.eample.iap.id,"transaction_id":"123456789","original_transaction_id":"123456789",

"purchase_date":"2020-03-19 16:18:27 Etc/GMT",

"purchase_date_ms":"1584634707000","purchase_date_pst":"2020-03-19 09:18:27 America/Los_Angeles",

"original_purchase_date":"2020-03-19 16:18:27 Etc/GMT",

"original_purchase_date_ms":"1584634707000",

"original_purchase_date_pst":"2020-03-19 09:18:27 America/Los_Angeles","is_trial_period":"false"}]}

Every date appears to be in the correct format! As mentioned in my deleted answer, regexes can become quite complex quite soon - for a look at what a regex for ISO-8601 dates looks like, go here - make yourself a coffee first!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.