6

I want to parse a big json file (3GB) and return a hash-map for each line in this file. My intuition was to use a transducer to process the file line-by-line and construct a vector with some selected fields (> 5% of bytes in the file).

However, the following code throw an OutOfMemory exception:

file.json

{"experiments": {"results": ...}}
{"experiments": {"results": ...}}
{"experiments": {"results": ...}}

parser.clj

(defn load-with!
  "Load a file using a parser, a structure and a transducer."
  [parser structure xform path]
  (with-open [r (clojure.java.io/reader path)]
    (into structure xform (parser r))))

(def xf (map #(get-in % ["experiments" "results"])))
(def parser (comp (partial map cheshire.core/parse-string) line-seq))

(load-with! parser (vector) xf "file.json")

When I visualize the process with JVisualVM, the heap grows over time and exceeds 25 GB before the process crashes.

Are transducers appropriate in this case ? Is there an better alternative ?

One of my requirement to return the new structure at the end of the function. Thus, I cannot use doseq to process the file in-place.

Moreover, I need to change the parser and transducer according to the file format.

Thank you !

6
  • I don't fully understand your code. What is the role of parser? It seems to be passed but unused. Also the expression (r) is probably not what you want, it calls the reader as a function. Commented Oct 22, 2016 at 21:23
  • 2
    I don't see why transducers would help. Transducers are useful when you have a series of operations you want to perform on the data; the transducer allows you to avoid creating intermediate data structures that will be thrown away. This code does only one thing--it maps get-in. Note into is non-lazy. Could you process the file lazily? Using for, map, or the sequence transducer function, could you create a lazy sequence of map entries? If they are properly handled, you could process each one without keeping all of the file contents in memory. Commented Oct 22, 2016 at 22:34
  • the goal of the parser/transducer is to easily adapt the work according to the file format (e.g. json, csv ...) and the vendor format within the file. Commented Oct 23, 2016 at 19:25
  • Could you give some more specifics on the data in the JSON file, e.g, number of lines and the size of each line? Or, even better, upload a representative version of the file somewhere so that we can reproduce the problem exactly? I tried your code on a very small file and that worked fine, but I was expecting it to break since getting 25G of memory usage from a 3G file suggests some sort of infinite loop or something. Commented Dec 19, 2016 at 9:14
  • @Mars Yes, in this particular case the xform isn't doing much. However for a different file you might wish to apply some filtering as well as some get-in operation, in which case having the load-with! fn accept an xform is definitely useful. As for processing the file lazily, as far as I can tell that should be the case since line-seq is lazy and so is map, but the OOM error obviously suggests that something is going wrong somewhere. Of course into is non-lazy but load-with! must return something non-lazy, and I think the point is that the extracted data is expected to fit into memory. Commented Dec 19, 2016 at 9:58

1 Answer 1

1

You're pretty close. I don't know what json/parse-string does, but if it's the same as json/read-str from here then this code should be what you are trying to do up there.

It looks like you were going for something like this:

(require '[clojure.data.json :as json])
(require '[clojure.java.io :as java])

(defn load-with!
  "Load a file using a parser, a structure and a transducer."
  [parser structure xform path]
  (with-open [r (java/reader path)]
    (into structure (xform (parser r)))))

(def xf (partial map #(get-in % ["experiments" "results"])))

(def parser (comp (partial map json/read-str) line-seq))


(load-with! parser [] xf "file.json")

I'm guessing these were just mistakes made from cutting out all the business details into your minimal example here. Using the code below I was able to process a large file for which the code above gave me an OOM error:

(require '[clojure.data.json :as json])
(require '[clojure.java.io :as java])

(def structure (atom []))

(defn do-it! [xform path]
  (with-open [r (java/reader path)]
    (doseq [line (line-seq r)]
      (swap! structure conj (xform line)))))

(defn xf [line]
  (-> (json/read-str line)
      (get-in ["experiments" "results"])))

(do-it! xf "file.json")

(take 10 @structure)
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for your proposal.is it necessary to use an atom in this
Thank you for your proposal. Is it necessary to have a global variable ? What is the difference compared to the solution with (into ...) ?
the first bit of code will work if you have enough memory. i think the atom is necessary with doseq. i ran out of time to research this, so my answer was only a minor improvement.
Would be nice if anyone commented on why the initial code isn't really stream-processing in constant memory as expected (and why the suggested code is).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.