3
$\begingroup$

I'm downloading this datafile from kaggle.

I want to get all the IDs.

Normally for a JSON file I'd write

Import["./arxiv-metadata-oai-snapshot.json", {"Data", "id"}]

But I get the error

Import::jsonexpendofinput: Unexpected character found while looking for the end of input.
Import::jsonhintposandchar: An error occurred near character '"', at line 2:3

The only way I'd know to fix this is to import the ENTIRE file. It's massive. I really don't want to do that.

$\endgroup$
3
  • $\begingroup$ https://jsonlint.com gives good suggestions for fixing JSON strings. Just paste your file in there & check. Maybe paste by sections, especially the section you're having trouble with (JSON is a hierarchical format and you can extract subsections at any level). $\endgroup$ Commented Apr 1, 2022 at 21:03
  • $\begingroup$ That file has the extension .json but its contents don't appear to be a valid JSON. Each line looks like valid JSON, but the entire file is not. This is a good thing, because a 3.5GB JSON string sounds like a horrible idea $\endgroup$ Commented Apr 1, 2022 at 21:07
  • $\begingroup$ Any trick to import each line? $\endgroup$ Commented Apr 1, 2022 at 21:18

1 Answer 1

8
$\begingroup$

Any trick to import each line?

The key here is to open the file as a stream rather than trying to import it at once. Once you have it open as a stream use ReadLine to read in a single line at a time. Then you can ImportString the line itself and get its "id" field.

Something like this would work

stream = OpenRead @ "arxiv-metadata-oai-snapshot.json";
(* use a dynamic array to capture the results incrementally *)
res = CreateDataStructure @ "DynamicArray";
Monitor[
    lineNumber = 0;
    While[(line = ReadLine @ stream) =!= EndOfFile,
        id = ImportString[line, "RawJSON"]["id"];
        res["Append", id];
        lineNumber++
    ],
    lineNumber
];
(* now convert the dynamic array into a list *)
res = Normal @ res;

Note I'm using Monitor so you can keep track of how far you are into the 2 million lines in the file.

You may find the above to be a little slow, because it has to import the entire JSON string for each line and grab the id field. If I know that each line has what I want written exactly like "id":"the_id_i_want" then I could replace the

id = ImportString[....

line with

id = First[
  StringCases[line, 
   "\"id\":\"" ~~ Shortest[id__] ~~ "\"" :> id], $Failed]

and it goes much faster.

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.