3

I saved a bunch of content in MarkLogic as binary format documents instead of XML. When I decode the document, it's XML. The side-effect of this error is that my searches don't include those documents.

Is there a way to convert the format of a document in-situ? If not, is there a way to do some kind of mass conversion? Any other ideas on how I can resolve this?

I know how to list all the URIs for binary documents:

xquery version "1.0-ml";
declare namespace qry  = "http://marklogic.com/cts/query";
let $binary-term :=
  xdmp:plan(/binary())//qry:term-query/qry:key/text()
let $binary_uris := cts:uris((), (), cts:term-query($binary-term))
return $binary_uris

and I know how to decode the documents:

xdmp:binary-decode(fn:doc($uri)/node(), "UTF-8")

but what I don't know is what to do after that. I can loop over that list of $binary_uris and decode them, but how do I take that result and overwrite the existing document in a batch process?

1 Answer 1

4

Depending upon how your docs were saved as binary() nodes, you might be able to used xdmp:quote() and then xdmp:unquote().

Below is a quick proof of concept that shows how content that was saved as binary can be turned back into either text or XML:

xquery version "1.0-ml";
xdmp:document-insert("/test.xml", 
  binary{ xs:hexBinary(xs:base64Binary(xdmp:base64-encode(xdmp:quote(<doc>test</doc>))))}),
xdmp:document-insert("/test.txt", 
  binary{ xs:hexBinary(xs:base64Binary(xdmp:base64-encode(xdmp:quote("test" ))))})
;
for $ext in ("xml", "txt")
let $doc := doc("/test." || $ext)
where $doc/node() instance of binary() 
      (: you could also restrict to docs who's URIs end with .xml, .txt, etc :)
return
  let $doc-text := xdmp:quote($doc)
  let $doc-decoded :=
    if (fn:starts-with($doc-text, "&lt;")) 
    then xdmp:unquote($doc-text)
    else $doc-text 
  return
    $doc-decoded
;
xdmp:document-delete("/test.xml"),
xdmp:document-delete("/test.txt")

If you wanted to "fix" the documents, you could then use xdmp:node-replace() to replace the binary() node with the decoded document:

xdmp:node-replace($doc/node(), $doc-decoded)

You could run a batch job, using the MarkLogic Java DMSDK or a CORB job to select those docs and re-save them.

Sign up to request clarification or add additional context in comments.

1 Comment

I've got the code for the first part (how to find the documents and how to decode them), but I hadn't thought about using the DMSDK and node-replace; thanks, that will help a bunch.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.