0

I have a docx document converted from pdf with pdf2docx library. The result seems good but if I load docx document with python-docx it creates a table with cells that contain texts instead of empty cells. The cells are filled with text from cells that is one row above the particular cells.

Table is look like this:

enter image description here

The table contains three rows. First row should contain cells with values [Barriere, Bonuslevel, Cap, Beobachtungszeitraum, Anfangl] and second and third rows should be empty except for last one column. But if can see in debug that empty cells contain text values like this: enter image description here

Text Basiswert is in the first cell and in the sixth cell. The sixth cell should be empty. I opened an XML file of Docx document and there is everything ok so I think the problem is in python-docx library. Have anyone ever had the same problem?

Edit: This article comes very valuable:

https://python-docx.readthedocs.io/en/latest/dev/analysis/features/table/cell-merge.html

Basically the copied cells are continuation cells which indicates that cells are merged into horizontal or vertical spans but still I dont know how to read this information from python-docx API?

4
  • I don't think python-docx can just imagine data where none exists. Are you sure this isn't about e.g. merged columns or such? Commented Oct 4, 2021 at 13:38
  • What do you mean by 'merged columns'? How should I recognize it? Commented Oct 4, 2021 at 13:53
  • I do not know what are the lines in your command panel but the hex address of your 6th cell is exactly the same that the 1st. As far as your table at the beginning of your post has 5 columns, wouldn't it be a reason of your problem? Commented Oct 4, 2021 at 15:11
  • Yes, I have notice it, too but still have no clue whats going there. I wanted to ignore this cells with the same id but when I start to iterate over rows, reference on the cells also change and and are not the same anymore. Commented Oct 4, 2021 at 16:25

1 Answer 1

1

The addressing of table cells in python-docx is based on the grid layout. Basically the grid is all the cells before any cell merging is done. In the grid layout there are n rows and m columns and m * n cells; each row-column combination/intersection has a cell.

When you address a grid cell that is "merged" into some other cell, then the top-left member of the merged (rectangular) region is returned.

This means that some content is returned more than once if the table includes merged cells.

Sign up to request clarification or add additional context in comments.

3 Comments

Ok so question is how I recognize cells which will be merged ?
Have a search on "python-docx merged cells", this is one discussion with some ideas you might find useful.
Yep, I have found something e.g. github.com/python-openxml/python-docx/issues/232 Attribute _tc of cell have information about span.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.