0

I have the following list of lists:

data= [
       [[0.025], 
        ['-DOCSTART-'], 
        ['O']],

       [[0.166, 0.001, 4.354, 4.366, 7.668], 
        ['Summary', 'of', 'Consolidated', 'Financial', 'Data'], 
        ['O', 'O', 'B-ORG', 'I-ORG', 'E-ORG']],

       [[0.195, 0.1, 0.0, 3.561, 3.793, 6.741, 4.0, 0.05], 
        ['Port', 'conditions', 'from', 'Lloyds', 'Shipping', 'Intelligence', 'Service', '--'], 
        ['S-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'E-ORG', 'O']]
      ]

Note: Every list within data[i] has equal length, i in [0, 1, 2].

I want to create a JSON file as follows:

[{
  "sentence": "-DOCSTART- Summary of Consolidated Financial Data Port conditions from Lloyds Shipping Intelligence Service --",
  "annotations": [
    {
      "decision": "Consolidated Financial Data",
      "category": "ORG",
      "token_loss": [4.354, 4.366, 7.668],
      "totalloss": 4.354+4.366+7.668 # Here, I consider the sum of "token_loss"
    },
    {
      "decision": "Port",
      "category": "PER",
      "token_loss": 18.44,
      "totalloss": 18.44
    },
    {
      "decision": "Lloyds Shipping Intelligence Service",
      "category": "ORG",
      "token_loss": [3.561, 3.793, 6.741, 4.0],
      "totalloss": 3.561+3.793+6.741+4.0
    }]
}]

In the lists, there is always a sequence of "B-" (Begin), "I-" (Inside), and "E-" (End). There is always a single word with "S-" (Single). I don't consider words where "O-" (Outside).


This is what I have started to try to solve this problem.

startIdx = 0
endIdx = 10
decisions = []
for tag in tags:
    if tag.startswith('B'):
        start = tags.index(tag)
        startIdx = start
        while startIdx<10:
            if tags[startIdx+1].startswith('I'):
                decisions.append(tokens[startIdx:startIdx+1])
                startIdx += 1
            if tags[startIdx+1].startswith('E'):
                decisions.append(tokens[startIdx:startIdx+1])
                startIdx = 11
1
  • I tried but didn't find a complete answer for this I didn't post my code because it isn't well structured yet. Commented Jul 1, 2021 at 0:34

1 Answer 1

3

You can use a generator function to produce the groupings:

import json, collections
data = [[[0.025], ['-DOCSTART-'], ['O']], [[0.166, 0.001, 4.354, 4.366, 7.668], ['Summary', 'of', 'Consolidated', 'Financial', 'Data'], ['O', 'O', 'B-ORG', 'I-ORG', 'E-ORG']], [[0.195, 0.1, 0.0, 3.561, 3.793, 6.741, 4.0, 0.05], ['Port', 'conditions', 'from', 'Lloyds', 'Shipping', 'Intelligence', 'Service', '--'], ['S-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'E-ORG', 'O']]]
def p_ranges(s):
   r = None
   for i, a in enumerate(s):
      if a != 'O':
        if a.startswith('S'):
           yield ([i], a.split('-')[-1])
        elif a.startswith('B'):
           r = [i]
        elif a.startswith('E'):
           yield (r+[i], a.split('-')[-1])
           r = None
        elif r:
           r.append(i)

def get_pairings(d):
    for a, b, c in d:
       yield ' '.join(b)
       for i, _c in p_ranges(c):
           yield {"decision":' '.join(b[j] for j in i), 
                  "category":_c, 
                  "token_loss":(t:=[a[j] for j in i]),
                  "totalloss":sum(t)}

d = collections.defaultdict(list)
for i in get_pairings(data):
   d[type(i)].append(i)

result = [{'sentence':' '.join(d[str]), 'annotations':d[dict]}]
print(json.dumps(result, indent=4))

Output:

[
    {
        "sentence": "-DOCSTART- Summary of Consolidated Financial Data Port conditions from Lloyds Shipping Intelligence Service --",
        "annotations": [
            {
                "decision": "Consolidated Financial Data",
                "category": "ORG",
                "token_loss": [
                    4.354,
                    4.366,
                    7.668
                ],
                "totalloss": 16.387999999999998
            },
            {
                "decision": "Port",
                "category": "PER",
                "token_loss": [
                    0.195
                ],
                "totalloss": 0.195
            },
            {
                "decision": "Lloyds Shipping Intelligence Service",
                "category": "ORG",
                "token_loss": [
                    3.561,
                    3.793,
                    6.741,
                    4.0
                ],
                "totalloss": 18.095
            }
        ]
    }
]

When running on your new sample:

data = [[[0.036, 0.937, 0.032, 2.985, 0.0, 0.044, 0.033, 0.539, 0.01, 0.009, 0.628, 0.706], ['At', 'Colchester', ':', 'Gloucestershire', '280', '(', 'J.', 'Russell', '63', ',', 'A.', 'Symonds'], ['O', 'S-LOC', 'O', 'S-ORG', 'O', 'O', 'B-PER', 'E-PER', 'O', 'O', 'B-PER', 'E-PER']]]
d = collections.defaultdict(list)
for i in get_pairings(data):
   d[type(i)].append(i)

result = [{'sentence':' '.join(d[str]), 'annotations':d[dict]}]
print(json.dumps(result, indent=4))

Output:

[
    {
        "sentence": "At Colchester : Gloucestershire 280 ( J. Russell 63 , A. Symonds",
        "annotations": [
            {
                "decision": "Colchester",
                "category": "LOC",
                "token_loss": [
                    0.937
                ],
                "totalloss": 0.937
            },
            {
                "decision": "Gloucestershire",
                "category": "ORG",
                "token_loss": [
                    2.985
                ],
                "totalloss": 2.985
            },
            {
                "decision": "J. Russell",
                "category": "PER",
                "token_loss": [
                    0.033,
                    0.539
                ],
                "totalloss": 0.5720000000000001
            },
            {
                "decision": "A. Symonds",
                "category": "PER",
                "token_loss": [
                    0.628,
                    0.706
                ],
                "totalloss": 1.334
            }
        ]
    }
]
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks a lot, @Ajax1234. Your code has exception when we have case like: [[0.036, 0.937, 0.032, 2.985, 0.0, 0.044, 0.033, 0.539, 0.01, 0.009, 0.628, 0.706], ['At', 'Colchester', ':', 'Gloucestershire', '280', '(', 'J.', 'Russell', '63', ',', 'A.', 'Symonds'], ['O', 'S-LOC', 'O', 'S-ORG', 'O', 'O', 'B-PER', 'E-PER', 'O', 'O', 'B-PER', 'E-PER']]
@Joe In your original data, should not [18.44, [0.195, 0.1, 0.0, 3.561, 3.793, 6.741, 4.0, 0.05]] really be [18.44, 0.195, 0.1, 0.0, 3.561, 3.793, 6.741, 4.0, 0.05]? In your most recent data, you don't have that nesting format for your float values even with S- type values present.
Thank you. Your code still has a traceback IndexError: list index out of range in "token_loss":(t:=[a[j] for j in i]) with cases like the one in my comment above.
@Joe Make sure your data is triple nested: [[[0.036, 0.937, 0.032, 2.985, 0.0, 0.044, 0.033, 0.539, 0.01, 0.009, 0.628, 0.706], ['At', 'Colchester', ':', 'Gloucestershire', '280', '(', 'J.', 'Russell', '63', ',', 'A.', 'Symonds'], ['O', 'S-LOC', 'O', 'S-ORG', 'O', 'O', 'B-PER', 'E-PER', 'O', 'O', 'B-PER', 'E-PER']]]. I added the output I get when running your latest data on my solution in my most recent edit

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.