I have the following list of lists:
data= [
[[0.025],
['-DOCSTART-'],
['O']],
[[0.166, 0.001, 4.354, 4.366, 7.668],
['Summary', 'of', 'Consolidated', 'Financial', 'Data'],
['O', 'O', 'B-ORG', 'I-ORG', 'E-ORG']],
[[0.195, 0.1, 0.0, 3.561, 3.793, 6.741, 4.0, 0.05],
['Port', 'conditions', 'from', 'Lloyds', 'Shipping', 'Intelligence', 'Service', '--'],
['S-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'E-ORG', 'O']]
]
Note: Every list within data[i] has equal length, i in [0, 1, 2].
I want to create a JSON file as follows:
[{
"sentence": "-DOCSTART- Summary of Consolidated Financial Data Port conditions from Lloyds Shipping Intelligence Service --",
"annotations": [
{
"decision": "Consolidated Financial Data",
"category": "ORG",
"token_loss": [4.354, 4.366, 7.668],
"totalloss": 4.354+4.366+7.668 # Here, I consider the sum of "token_loss"
},
{
"decision": "Port",
"category": "PER",
"token_loss": 18.44,
"totalloss": 18.44
},
{
"decision": "Lloyds Shipping Intelligence Service",
"category": "ORG",
"token_loss": [3.561, 3.793, 6.741, 4.0],
"totalloss": 3.561+3.793+6.741+4.0
}]
}]
In the lists, there is always a sequence of "B-" (Begin), "I-" (Inside), and "E-" (End). There is always a single word with "S-" (Single). I don't consider words where "O-" (Outside).
This is what I have started to try to solve this problem.
startIdx = 0
endIdx = 10
decisions = []
for tag in tags:
if tag.startswith('B'):
start = tags.index(tag)
startIdx = start
while startIdx<10:
if tags[startIdx+1].startswith('I'):
decisions.append(tokens[startIdx:startIdx+1])
startIdx += 1
if tags[startIdx+1].startswith('E'):
decisions.append(tokens[startIdx:startIdx+1])
startIdx = 11