2

I am trying to load a bunch of JSON files into MarkLogic 8 using MLCP and a basic transform script on ingest.

I can load the files as-is, I get JSON objects in ML.

What I want is to transform, on ingestion, from JSON to XML, so I wrote a basic transform like so :

xquery version "1.0-ml";

module namespace ingest = "http://dikw.com/ingest/linkedin";

import module namespace json="http://marklogic.com/xdmp/json" at "/MarkLogic/json/json.xqy";
import module namespace sem="http://marklogic.com/semantics" at "/MarkLogic/semantics.xqy";

declare namespace basic="http://marklogic.com/xdmp/json/basic";

declare default function namespace "http://www.w3.org/2005/xpath-functions";

declare option xdmp:mapping "false";

declare function ingest:transform(
  $content as map:map,
  $context as map:map
) as map:map*
{
  let $org-doc := map:get($content, "value")
  let $jsonxml := json:transform-from-json($org-doc)
  let $name := $jsonxml//basic:full__name
  let $_ := xdmp:log(concat('Inserting linkedin profile ', $name, '.xml..'))
  let $new-doc := 
    document {
      <json>{
        $jsonxml
      }</json>
  }
  return (
    map:put($content, "value", $new-doc),
    $content
  )
};

Now if I use MLCP to load the docs without the transform it works but as stated above I get JSON inside ML8. (I use Roxy to invoke the right environment to load to for MLCP.)

./ml $ENV mlcp import -input_file_path content/linkedin -input_file_type documents  

The above works ok.

But using the transform like so:

./ml $ENV mlcp import -input_file_path content/linkedin -input_file_type documents  -transform_module /ingest/linkedin.xqy -output_collections incoming,incoming/linkedin

I get an error: "ERROR contentpump.MultithreadedMapper: Unknown content type: json"

15/06/22 17:37:12 INFO contentpump.ContentPump: Hadoop library version: 2.0.0-mr1-cdh4.3.0
15/06/22 17:37:12 INFO contentpump.LocalJobRunner: Content type is set to MIXED.  The format of the  inserted documents will be determined by the MIME  type specification configured on MarkLogic Server.
15/06/22 17:37:12 WARN util.KerberosName: Kerberos krb5 configuration not found, setting default realm to empty
15/06/22 17:37:12 INFO input.FileInputFormat: Total input paths to process : 9
15/06/22 17:37:13 ERROR contentpump.MultithreadedMapper: Unknown content type: json
java.lang.IllegalArgumentException: Unknown content type: json
    at com.marklogic.mapreduce.ContentType.forName(ContentType.java:107)
    at com.marklogic.contentpump.utilities.TransformHelper.getTransformInsertQry(TransformHelper.java:124)
    at com.marklogic.contentpump.TransformWriter.write(TransformWriter.java:97)
    at com.marklogic.contentpump.TransformWriter.write(TransformWriter.java:46)
    at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:106)
    at com.marklogic.contentpump.DocumentMapper.map(DocumentMapper.java:46)
    at com.marklogic.contentpump.DocumentMapper.map(DocumentMapper.java:32)
    at com.marklogic.contentpump.BaseMapper.runThreadSafe(BaseMapper.java:51)
    at com.marklogic.contentpump.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:376)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

In the Query Console things work as expected, transforming a JSON variable into an XML document as expected...

What am I missing here?

tx

hugo

3
  • In addition to below answers: are you using latest MLCP? Commented Jun 22, 2015 at 18:55
  • @grtjn : I run it from latest roxy on github ./ml hugo mlcp -v java -cp "/usr/local/mlcp/lib/hadoop-auth-2.0.0-alpha.jar...lot-of-stuff-here..." com.marklogic.contentpump.ContentPump Commented Jun 23, 2015 at 8:48
  • If you don't recall which version of MLCP you installed yourself under /usr/local/mlcp/, then we might be able to deduce that from the xcc jar version. Does it show with -v? Commented Jun 23, 2015 at 10:47

1 Answer 1

3

As per https://docs.marklogic.com/guide/ingestion/content-pump#id_82518 there seems a few thigns missing.

You are not specifying a document-type to store (-document_type xml) - you are only storing xml, but using a "documents" as the input type (assuming these are .json extension?) - so the code doesnt know that the transform is converting from json to xmls.

You are not changing the URI - so the default mime mappings will not know that your input and output types expect to differ:

https://docs.marklogic.com/guide/ingestion/content-pump#id_17589

No matter what suffix you use it wont work for a JSON input and XML storage without additional information supplied (see above links)

Sign up to request clarification or add additional context in comments.

1 Comment

indeed when adding -document_type xml -transform_namespace dikw.com/ingest/linkedin it works... bit tricky that you have to specify in mlcp what you will do in the transform.xqy ... thx

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.