0

I am trying to write a .jsonl file that needs to look like this:

{"file_name": "0001.png", "text": "This is a golden retriever playing with a ball"}
{"file_name": "0002.png", "text": "A german shepherd"}
{"file_name": "0003.png", "text": "One chihuahua"}

This is my attempt:

import json
import pandas as pd

dt = pd.read_csv('data.csv')
df = pd.DataFrame(dt)

file_name = df['image']
file_caption = df['text']

data = []

for i in range(len(file_name)):
    entry = {"file_name": file_name[i], "text": file_caption[i]}
    data.append(entry)

json_object = json.dumps(data, indent=4)

# Writing to sample.json
with open("metadata.jsonl", "w") as outfile:
    outfile.write(json_object)

But this is the output I get:

[
    {
        "file_name": "images/image_0.jpg",
        "text": "Fattoush Salad with Roasted Potatoes"
    },
    {
        "file_name": "images/image_1.jpg",
        "text": "an analysis of self portrayal in novels by virginia woolf A room of one's own study guide contains a biography of virginia woolf, literature essays, quiz questions, major themes, characters, and a full summary and analysis about a room of one's own a room of one's own summary."
    },
    {
        "file_name": "images/image_2.jpg",
        "text": "Christmas Comes Early to U.K. Weekly Home Entertainment Chart"
    },
    {
        "file_name": "images/image_3.jpg",
        "text": "Amy Garcia Wikipedia a legacy of reform: dorothea dix (1802\u20131887) | states of"
    },
    {
        "file_name": "images/image_4.jpg",
        "text": "3D Metal Cornish Harbour Painting"
    },
    {
        "file_name": "images/image_5.jpg",
        "text": "\"In this undated photo provided by the New York City Ballet, Robert Fairchild performs in \"\"In Creases\"\" by choreographer Justin Peck which is being performed by the New York City Ballet in New York. (AP Photo/New York City Ballet, Paul Kolnik)\""
    },
...
]

I know that its because I am dumping a list so I know where I'm going wrong but how do I create a .jsonl file like the format above?

1 Answer 1

4

Don't indent the generated JSON and don't append it to a list. Just write out each line to the file:

import json
import pandas as pd

df = pd.DataFrame([['0001.png', "This is a golden retriever playing with a ball"],
                   ['0002.png', "A german shepherd"],
                   ['0003.png', "One chihuahua"]], columns=['filename','text'])

with open("metadata.jsonl", "w") as outfile:
    for file, caption in zip(df['filename'], df['text']):
        entry = {"file_name": file, "text": caption}
        print(json.dumps(entry), file=outfile)

Output:

{"file_name": "0001.png", "text": "This is a golden retriever playing with a ball"}
{"file_name": "0002.png", "text": "A german shepherd"}
{"file_name": "0003.png", "text": "One chihuahua"}
Sign up to request clarification or add additional context in comments.

3 Comments

I got way more than 3 lines. Its about a few thousand so I can't just write all of them out.
@oo92 Sure you can, if they are already in the dataframe. I typed them as an example since you didn't provide any sample input file.
@oo92 You can also use df.to_json('metadata.jsonl', orient='records', lines=True) to do it in one step.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.