Create JSON file from list of lists matching

Question

I have the following list of lists:

data= [
       [[0.025], 
        ['-DOCSTART-'], 
        ['O']],

       [[0.166, 0.001, 4.354, 4.366, 7.668], 
        ['Summary', 'of', 'Consolidated', 'Financial', 'Data'], 
        ['O', 'O', 'B-ORG', 'I-ORG', 'E-ORG']],

       [[0.195, 0.1, 0.0, 3.561, 3.793, 6.741, 4.0, 0.05], 
        ['Port', 'conditions', 'from', 'Lloyds', 'Shipping', 'Intelligence', 'Service', '--'], 
        ['S-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'E-ORG', 'O']]
      ]

Note: Every list within data[i] has equal length, i in [0, 1, 2].

I want to create a JSON file as follows:

[{
  "sentence": "-DOCSTART- Summary of Consolidated Financial Data Port conditions from Lloyds Shipping Intelligence Service --",
  "annotations": [
    {
      "decision": "Consolidated Financial Data",
      "category": "ORG",
      "token_loss": [4.354, 4.366, 7.668],
      "totalloss": 4.354+4.366+7.668 # Here, I consider the sum of "token_loss"
    },
    {
      "decision": "Port",
      "category": "PER",
      "token_loss": 18.44,
      "totalloss": 18.44
    },
    {
      "decision": "Lloyds Shipping Intelligence Service",
      "category": "ORG",
      "token_loss": [3.561, 3.793, 6.741, 4.0],
      "totalloss": 3.561+3.793+6.741+4.0
    }]
}]

In the lists, there is always a sequence of "B-" (Begin), "I-" (Inside), and "E-" (End). There is always a single word with "S-" (Single). I don't consider words where "O-" (Outside).

This is what I have started to try to solve this problem.

startIdx = 0
endIdx = 10
decisions = []
for tag in tags:
    if tag.startswith('B'):
        start = tags.index(tag)
        startIdx = start
        while startIdx<10:
            if tags[startIdx+1].startswith('I'):
                decisions.append(tokens[startIdx:startIdx+1])
                startIdx += 1
            if tags[startIdx+1].startswith('E'):
                decisions.append(tokens[startIdx:startIdx+1])
                startIdx = 11

I tried but didn't find a complete answer for this I didn't post my code because it isn't well structured yet. — joe
– joe, Commented Jul 1, 2021 at 0:34

Ajax1234 · Accepted Answer · 2021-07-01 01:33:01Z

3

You can use a generator function to produce the groupings:

import json, collections
data = [[[0.025], ['-DOCSTART-'], ['O']], [[0.166, 0.001, 4.354, 4.366, 7.668], ['Summary', 'of', 'Consolidated', 'Financial', 'Data'], ['O', 'O', 'B-ORG', 'I-ORG', 'E-ORG']], [[0.195, 0.1, 0.0, 3.561, 3.793, 6.741, 4.0, 0.05], ['Port', 'conditions', 'from', 'Lloyds', 'Shipping', 'Intelligence', 'Service', '--'], ['S-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'E-ORG', 'O']]]
def p_ranges(s):
   r = None
   for i, a in enumerate(s):
      if a != 'O':
        if a.startswith('S'):
           yield ([i], a.split('-')[-1])
        elif a.startswith('B'):
           r = [i]
        elif a.startswith('E'):
           yield (r+[i], a.split('-')[-1])
           r = None
        elif r:
           r.append(i)

def get_pairings(d):
    for a, b, c in d:
       yield ' '.join(b)
       for i, _c in p_ranges(c):
           yield {"decision":' '.join(b[j] for j in i), 
                  "category":_c, 
                  "token_loss":(t:=[a[j] for j in i]),
                  "totalloss":sum(t)}

d = collections.defaultdict(list)
for i in get_pairings(data):
   d[type(i)].append(i)

result = [{'sentence':' '.join(d[str]), 'annotations':d[dict]}]
print(json.dumps(result, indent=4))

Output:

[
    {
        "sentence": "-DOCSTART- Summary of Consolidated Financial Data Port conditions from Lloyds Shipping Intelligence Service --",
        "annotations": [
            {
                "decision": "Consolidated Financial Data",
                "category": "ORG",
                "token_loss": [
                    4.354,
                    4.366,
                    7.668
                ],
                "totalloss": 16.387999999999998
            },
            {
                "decision": "Port",
                "category": "PER",
                "token_loss": [
                    0.195
                ],
                "totalloss": 0.195
            },
            {
                "decision": "Lloyds Shipping Intelligence Service",
                "category": "ORG",
                "token_loss": [
                    3.561,
                    3.793,
                    6.741,
                    4.0
                ],
                "totalloss": 18.095
            }
        ]
    }
]

When running on your new sample:

data = [[[0.036, 0.937, 0.032, 2.985, 0.0, 0.044, 0.033, 0.539, 0.01, 0.009, 0.628, 0.706], ['At', 'Colchester', ':', 'Gloucestershire', '280', '(', 'J.', 'Russell', '63', ',', 'A.', 'Symonds'], ['O', 'S-LOC', 'O', 'S-ORG', 'O', 'O', 'B-PER', 'E-PER', 'O', 'O', 'B-PER', 'E-PER']]]
d = collections.defaultdict(list)
for i in get_pairings(data):
   d[type(i)].append(i)

result = [{'sentence':' '.join(d[str]), 'annotations':d[dict]}]
print(json.dumps(result, indent=4))

Output:

[
    {
        "sentence": "At Colchester : Gloucestershire 280 ( J. Russell 63 , A. Symonds",
        "annotations": [
            {
                "decision": "Colchester",
                "category": "LOC",
                "token_loss": [
                    0.937
                ],
                "totalloss": 0.937
            },
            {
                "decision": "Gloucestershire",
                "category": "ORG",
                "token_loss": [
                    2.985
                ],
                "totalloss": 2.985
            },
            {
                "decision": "J. Russell",
                "category": "PER",
                "token_loss": [
                    0.033,
                    0.539
                ],
                "totalloss": 0.5720000000000001
            },
            {
                "decision": "A. Symonds",
                "category": "PER",
                "token_loss": [
                    0.628,
                    0.706
                ],
                "totalloss": 1.334
            }
        ]
    }
]

edited Jul 1, 2021 at 1:33

answered Jul 1, 2021 at 0:30

Ajax1234

71.7k9 gold badges67 silver badges110 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

joe Over a year ago

Thanks a lot, @Ajax1234. Your code has exception when we have case like:

[[0.036, 0.937, 0.032, 2.985, 0.0, 0.044, 0.033, 0.539, 0.01, 0.009, 0.628, 0.706], ['At', 'Colchester', ':', 'Gloucestershire', '280', '(', 'J.', 'Russell', '63', ',', 'A.', 'Symonds'], ['O', 'S-LOC', 'O', 'S-ORG', 'O', 'O', 'B-PER', 'E-PER', 'O', 'O', 'B-PER', 'E-PER']]

Ajax1234 Over a year ago

@Joe In your original data, should not [18.44, [0.195, 0.1, 0.0, 3.561, 3.793, 6.741, 4.0, 0.05]] really be [18.44, 0.195, 0.1, 0.0, 3.561, 3.793, 6.741, 4.0, 0.05]? In your most recent data, you don't have that nesting format for your float values even with S- type values present.

joe Over a year ago

Thank you. Your code still has a traceback IndexError: list index out of range in "token_loss":(t:=[a[j] for j in i]) with cases like the one in my comment above.

Ajax1234 Over a year ago

@Joe Make sure your data is triple nested:

[[[0.036, 0.937, 0.032, 2.985, 0.0, 0.044, 0.033, 0.539, 0.01, 0.009, 0.628, 0.706], ['At', 'Colchester', ':', 'Gloucestershire', '280', '(', 'J.', 'Russell', '63', ',', 'A.', 'Symonds'], ['O', 'S-LOC', 'O', 'S-ORG', 'O', 'O', 'B-PER', 'E-PER', 'O', 'O', 'B-PER', 'E-PER']]]

. I added the output I get when running your latest data on my solution in my most recent edit

Collectives™ on Stack Overflow

Create JSON file from list of lists matching

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related