0

I have this kind of data:

[{"id": 1, "name": "Alex", "projects": ["A", "B", "C"]}, {"id": 2, "name": "Bob", "projects": None}]

And I need .xslx in this format:

Each element of the list in a separate row but without other columns getting duplicated(merged cells)

Each element of the list in a separate row but without other columns getting duplicated(merged cells)

But I need to achieve this a dynamic way, I won't have cell indices statically and finding those cell indices will a bit too complex.

I use pandas and xlswriter as engine to generate the .xlsx file. I can use other modules if needed.

2
  • 3
    Hello, did you already already try something yourself ? Commented Apr 30 at 9:58
  • Check the explode function and the rest of the Reshaping and pivot tables page. Once you load the data into a dataframe you probably only need df.explode('projects') Commented Apr 30 at 10:08

3 Answers 3

3

You can "explode" the list values to rows using the explode function. You'll find this and other ways to reshape dataframes in the Reshaping and pivot tables page

By default, to_excel generates merged cells for multi-index rows, so you need to use a MultiIndex with this dataframe.

The resulting code is rather short :

import pandas as pd

data=[{"id": 1, "name": "Alex", "projects": ["A", "B", "C"]}, {"id": 2, "name": "Bob", "projects": None}]

df=pd.DataFrame(data)
df=df.explode('projects')
df=df.set_index(['id','name','projects'])
df.to_excel(r'c:\spikes\exploded.xlsx')

Which generates

Generated Excel image(https://i.sstatic.net/DpniEW4E.png)

Loading from a database

A comment suggests the data is loaded from a database. In that case it's probably easier to use read_sql to load the flat data and avoid explode. The index columns can be specified in read_sql directly. The code could look like this :

df=pd.read_sql(sql, conn,index_col=['id','name','projects'])
df.to_excel(r'c:\spikes\exploded.xlsx')

The projects can be grouped afterwards if needed using eg

df_grouped=df.groupby(['id','name']).agg({'projects':list})
Sign up to request clarification or add additional context in comments.

6 Comments

Can you explain why there is only one id but multiple Alex's? And how can I make Alex appear only once as well?
Oops, I didn't notice. projects must be added to the index as well. Leaf nodes aren't merged
gives error if projects added to the index with the following message: TypeError: unhashable type: 'list'
Some parentheses were missing in the first snippet. I fixed this too. BTW if you load the data from a database it's easier to use read_sql to read the flat data and save to Excel. If you need to group the projects for something else you can use df.groupby(['id','name']).agg({'projects':list})
it works if df = df.explode('projects) used instead of df.explode('projects') And pipeline method works fine. Thank you very much @panagiotis
|
1

You can iterate over the rows in Pandas DF and append each row.

import pandas as pd
from openpyxl import Workbook
from openpyxl.utils import get_column_letter

# Sample data
data = [
    {"id": 1, "name": "Alex", "projects": ["A", "B", "C"]},
    {"id": 2, "name": "Bob", "projects": None},
    {"id": 3, "name": "Robby", "projects": ["Z"]}
]

# Prepare a list to hold the rows for the DataFrame
rows = []

# Process the data
for entry in data:
    id_value = entry["id"]
    name_value = entry["name"]
    projects = entry["projects"]
    
    if projects is None:
        rows.append({"id": id_value, "name": name_value, "project": None})
    else:
        for project in projects:
            rows.append({"id": id_value, "name": name_value, "project": project})

# Create a DataFrame
df = pd.DataFrame(rows)

# Create a new Excel workbook and select the active worksheet
wb = Workbook()
ws = wb.active

# Write the header
header = df.columns.tolist()
ws.append(header)

# Write the data
for index, row in df.iterrows():
    ws.append(row.tolist())

# Merge cells for 'id' and 'name' where applicable
for index in range(len(df)):
    if index == 0 or df.iloc[index]['id'] != df.iloc[index - 1]['id']:
        start_row = index + 2  # +2 because openpyxl is 1-indexed and we have a header
        end_row = start_row
        while end_row < len(df) + 1 and df.iloc[end_row - 2]['id'] == df.iloc[start_row - 2]['id']:
            end_row += 1
        if end_row - start_row > 1:
            ws.merge_cells(start_row=start_row, start_column=1, end_row=end_row - 1, end_column=1)  # Merge 'id'
            ws.merge_cells(start_row=start_row, start_column=2, end_row=end_row - 1, end_column=2)  # Merge 'name'

# Save the workbook
wb.save("output.xlsx")

print("Done!")

3 Comments

There's no need to do that. Pandas already has methods to reshape datframes, handle lists and generate merged cells
I tried this, there are some other complications as well but the main reason I cannot use this is that every time when I fetch data from database, convert it into python list and create dataframe the order of columns change. It got more complicated afterwards so I came here to find new solutions. As a last resort will return back to this approach
@AykhanAghayev why don't you use read_sql to load the dataframe directly ? You won't have to use explode then. You can use other operations to reshape the dataframe and group the projects afterwards. You can use pivot_table or group with list as the aggregate function as this question shows
0

An answer similar to Mario's but with a single pass through the data (and keeps the "project" column as the last one):

data = [{"id": 1, "name": "Alex", "projects": ["A", "B", "C"]},
        {"id": 2, "name": "Bob", "projects": ["D", "E"]},
        {"id": 3, "name": "Charlie", "projects": None},
        {"id": 4, "name": "Devin", "projects": []},
        {"id": 5, "name": "Ellen", "projects": ["F", "G"]},]

# create a new excel file
import openpyxl
wb = openpyxl.Workbook()

# fill the first row with the keys of the first dictionary
for key in data[0].keys():
    wb.active.cell(row=1, column=list(data[0].keys()).index(key) + 1).value = key

current_row = 2
for item in data:

    item_without_projects = item.copy()
    item_without_projects.pop("projects")

    # handle the case where projects is None or an empty list
    if item["projects"] is None or len(item["projects"]) == 0:
        for col, key in enumerate(item_without_projects.keys()):
            wb.active.cell(row=current_row, column=col + 1).value = str(item[key])
            
        current_row += 1
        continue

    # fill a row with data for each project
    for project in item["projects"]:
        max_col = 0
        for col, key in enumerate(item_without_projects.keys()):
            wb.active.cell(row=current_row, column=col + 1).value = str(item[key])
            max_col = max(max_col, col + 1)
        wb.active.cell(row=current_row, column=max_col + 1).value = project
        current_row += 1

    # merge the newly created cells
    for col, key in enumerate(item_without_projects.keys()):
        print(current_row - len(item["projects"]), current_row, col + 1)
        wb.active.merge_cells(start_row=current_row - len(item["projects"]), start_column=col + 1, end_row=current_row - 1, end_column=col + 1)
        # style: set text alignment to center
        cell = wb.active.cell(row=current_row - len(item["projects"]), column=col + 1)
        cell.alignment = openpyxl.styles.Alignment(horizontal='center', vertical='center')

# write the workbook
wb.save("data.xlsx")

This clearly doesn't use pandas, but from the other comments you posted I understand .explode won't solve your problem. I'd be happy to answer if you can better formulate the various issues for the data loading

2 Comments

explode works. And even if it didn't there's no reason to process rows one by one instead of using Dataframe operations
I'm not saying it doesn't work. I'm saying he pointed out the problem was something else in the data loading step

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.