I have a list like below word_list:
[
[{'bottom': Decimal('58.650'),
'text': 'Hi there!',
'top': Decimal('40.359'),
'x0': Decimal('21.600'),
'x1': Decimal('65.644')}
],
[{'bottom': Decimal('74.101'),
'text': 'Your email',
'top': Decimal('37.519'),
'x0': Decimal('223.560'),
'x1': Decimal('300')},
{'bottom': Decimal('77.280'),
'text': '[email protected]',
'top': Decimal('62.506'),
'x0': Decimal('21.600'),
'x1': Decimal('140.775')}]
]
As you can see, above consists of a list, with what looks like a nested list. The text of the above can be represented:
[0] = 'Hi there!'
[1] = 'Your Email'
[1] = '[email protected]'
This is my code, that generates the row_list:
word_list = sorted(first_page.extract_words(),
key=lambda x: x['bottom'])
threshold = float('10')
current_row = [word_list[0], ]
row_list = [current_row, ]
for word in word_list[1:]:
if abs(current_row[-1]['bottom'] - word['bottom']) <= threshold:
# distance is small, use same row
current_row.append(word)
else:
# distance is big, create new row
current_row = [word, ]
row_list.append(current_row)
What I am trying to do, is to map the output of above to something like:
new = {
1: {
1: {'text': 'Hi there!', 'x0': Decimal('21.600')}
},
2: {
1: {'text':'Your email', 'x0': Decimal('223.560')},
2: {'text': '[email protected]', 'x0': Decimal('21.600')}
}
}
I have tried all sorts of things, and just can't figure it out - as my original word_list is a list, and I am trying to show it as a dict...
row_listis sorted but don’t need to be in the final format.