2

I want to create a table in the duckdb database from mongo collection in python, for further analytics. Now I do the following:

  • dump mongo collection as jsonl to disk (single file)
  • open duckdb connection and load jsonl file into a table
with open(f"mongo_json.jsonl", "w") as file:
       json.dump(list(mongo_cursor), file, default=str)

duckdb.sql(f"CREATE OR REPLACE TABLE mongo_table AS SELECT *,  FROM read_json_auto('mongo_json.jsonl', IGNORE_ERRORS=true)")

But the thing is the json is really big, which increase the memory consumption. So Are there any ideas or better approach to achieve this ?

3
  • github.com/duckdb/duckdb/issues/11684 was a recent question about read_json_auto memory usage and the recommendation was to use ndjson/jsonl - which is what you doing here. Are you saying your approach works and the problem is just large memory usage? Commented Apr 17, 2024 at 16:50
  • Yes, exactly the problem is in duckdb. Any idea how to fix this issue? Commented Apr 17, 2024 at 20:07
  • You can try changing the default memory limits: duckdb.org/docs/configuration/overview.html#examples Other than that - you may need to talk to the duckdb people if you don't get any further feedback here. Commented Apr 17, 2024 at 22:22

1 Answer 1

1

If your data could fit into memory, check out pymongoarrow (link). You can use it to grab arrow tables from mongo which can easily be ingested into duckdb. You might even be able to do this in chunks to prevent going oom.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the info, but the problem is I do not know the schema, and the data inside mongoDB is inconsistent

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.