1

I have a dataframe that contains a product name, question, and answers. I would like to process the dataframe and transform it into a JSON format. Each product should have nested sections for questions and answers.

My dataframe:

import polars as pl

df = pl.DataFrame({
    "Product": ["X", "X", "Y", "Y"],
    "Question": ["Q1", "Q2", "Q3", "Q4"],
    "Anwers": ["A1", "A2", "A3", "A4"],
}) 

Desired Output:

{
    "faqByCommunity": {
        "id": 5,
        "communityName": "name",
        "faqList": [
            {
                "id": 1,
                "product": "X",
                "faqs": [
                    {
                        "id": 1,
                        "question": "Q1",
                        "answer": "A1"
                    },
                    {
                        "id": 2,
                        "question": "Q2",
                        "answer": "A2"
                    }

                ]
            },
            {
                "id": 2,
                "product": "Y",
                "faqs": [
                    {
                        "id": 1,
                        "question": "Q3",
                        "answer": "A3"
                    },
                    {
                        "id": 2,
                        "question": "Q4",
                        "answer": "A4"
                    }

                ]
            }
        ]
    }
}

Since the first part it's static , i think i could append it to the file before and after polars writes to it (Like my other question ). However, im not sure how can i work with the nested part

2
  • Is there a compelling reason to use polars here? Commented Mar 11 at 18:13
  • Because i got the table from a excel, so i wanted to use it as a dataframe Commented Mar 11 at 18:19

2 Answers 2

2

You could do some of the reshaping in Polars first.

faq_list = (
    df.group_by("product", maintain_order=True)
      .agg(faqs=pl.struct(pl.int_range(pl.len()).alias("id") + 1, pl.exclude("product")))
      .with_row_index("id", offset=1)
      #.to_struct()
      #.to_list()
)
shape: (2, 3)
┌─────┬─────────┬────────────────────────────────┐
│ id  ┆ product ┆ faqs                           │
│ --- ┆ ---     ┆ ---                            │
│ u32 ┆ str     ┆ list[struct[3]]                │
╞═════╪═════════╪════════════════════════════════╡
│ 1   ┆ X       ┆ [{1,"Q1","A1"}, {2,"Q2","A2"}] │
│ 2   ┆ Y       ┆ [{1,"Q3","A3"}, {2,"Q4","A4"}] │
└─────┴─────────┴────────────────────────────────┘

With the to_struct/list uncommented:

[{'id': 1,
  'product': 'X',
  'faqs': [{'id': 1, 'question': 'Q1', 'answer': 'A1'},
   {'id': 2, 'question': 'Q2', 'answer': 'A2'}]},
 {'id': 2,
  'product': 'Y',
  'faqs': [{'id': 1, 'question': 'Q3', 'answer': 'A3'},
   {'id': 2, 'question': 'Q4', 'answer': 'A4'}]}]

You could then add the static parts and pretty-print it with json.dumps

print(
    json.dumps({
        "faqByCommunity": {
            "id": 5,
            "communityName": "name",
            "faqList": faq_list 
        }
    }, indent=4)
)

You could also add the static parts with Polars if you really wanted to.

print(
    json.dumps(
        (df.group_by("product", maintain_order=True)
           .agg(
                faqs = pl.struct(
                    pl.int_range(pl.len()).alias("id") + 1, 
                    pl.exclude("product")
                )
           )
           .with_row_index("id", offset=1)
           .select(
               pl.struct(
                   faqByCommunity = pl.struct(
                       id = 5,  
                       communityName = pl.lit("name"), 
                       faqList = pl.struct(pl.all()).implode()
                   )
               )
           )
           .item()
        ),
        indent = 4
    )
)
Sign up to request clarification or add additional context in comments.

Comments

2

Not knowing more about the amount of data you have, I would probably just use iter_rows() over the data frame and build the resulting dictionary by hand rather than try to do something more nuanced in polars, but then again I am not a polars expert but from what I see polars does not support a great deal of flexibility with to_json().

Something like:

import polars as pl

df = pl.DataFrame({
    "Product": ["X", "X", "Y", "Y"],
    "Question": ["Q1", "Q2", "Q3", "Q4"],
    "Anwers": ["A1", "A2", "A3", "A4"],
})

## ---------------
## Cluster rows by product 
## ---------------
product_data = {}
for row in df.iter_rows():
    product_data.setdefault(row[0], []).append(row[1:])
## ---------------

## ---------------
## Build the results dictionary using the clustered data
## and nested list comprehensions
## ---------------
results = {
    "faqByCommunity": {
        "id": 5,
        "communityName": "name",
        "faqList": [
            {
                "id": product_index,
                "product": product,
                "faqs": [
                    {
                        "id": qna_index,
                        "question": question,
                        "answer": answer
                    }
                    for qna_index, (question, answer) in enumerate(qnas, start=1)
                ]
            }
            for product_index, (product, qnas) in enumerate(product_data.items(), start=1)
        ]
    }
}
## ---------------

## ---------------
## Display the results
## ---------------
import json
print(json.dumps(results, indent=4))
## ---------------

Should give you the result you stated.

2 Comments

.rows_by_key("product", named=True) does the defaultdict stuff for you which may help a little.
@jqurious Can you show me an example? I just tried with product_data = (mdp_2.rows_by_key("Producto", named=True))

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.