Transform a code tokens list into valid string code

Question

I have written code to transform Python code into a list to compute BLEU score:

import re

def tokenize_for_bleu_eval(code):
    code = re.sub(r'([^A-Za-z0-9_])', r' \1 ', code)
    code = re.sub(r'([a-z])([A-Z])', r'\1 \2', code)
    code = re.sub(r'\s+', ' ', code)
    code = code.replace('"', '`')
    code = code.replace('\'', '`')
    tokens = [t for t in code.split(' ') if t]

    return tokens

Thanks to this snippet my code struct.unpack('h', pS[0:2]) is parsed properly into the list ['struct', '.', 'unpack', '(', 'h', ',', 'p', 'S', '[', '0', ':', '2', ']', ')'].

Initially, I thought I need simply to use the ' '.join(list_of_tokens) but it kills my variable names like this struct . unpack ('h' , p S [ 0 : 2 ] ) and my code is not executable.

I tried to use Regex to stick some variable names but I can't succeed to reverse my function tokenize_for_bleu_eval to find executable code at the end. Is someone get an idea, perhaps without regex which seems to be too complicated here?

EDIT: We can't just remove all spaces between element of the list because there are examples like items = [item for item in container if item.attribute == value] where the result of the backtranslation without space would be itemforiteminaifitem[0]==1 which is not valid.

Can you give an example of a correct output for your sample list? — Tzane
– Tzane, Commented Jul 2, 2021 at 7:13
I am not sure what you mean. The intended output is the initial code, i.e. struct.unpack('h', pS[0:2]) — NathanaëlBeau
– NathanaëlBeau, Commented Jul 2, 2021 at 9:20
Regex looks like a very inappropriate tool here. Did you consider a Python code parser approach? I have not tried that, but you might find some tips here. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jul 2, 2021 at 9:59

Rizquuula · Accepted Answer · 2021-07-02 10:17:39Z

I am trying to merge the tokens using this script

import re

def tokenize_for_bleu_eval(code):
    code = re.sub(r'([^A-Za-z0-9_])', r' \1 ', code)
    code = re.sub(r'([a-z])([A-Z])', r'\1 \2', code)
    code = re.sub(r'\s+', ' ', code)
    code = code.replace('"', '`')
    code = code.replace('\'', '`')
    tokens = [t for t in code.split(' ') if t]

    return tokens

def merge_tokens(tokens):
    code = ''.join(tokens)
    code = code.replace('`', "'")
    code = code.replace(',', ", ")

    return code

tokenize = tokenize_for_bleu_eval("struct.unpack('h', pS[0:2])")
print(tokenize)  # ['struct', '.', 'unpack', '(', '`', 'h', '`', ',', 'p', 'S', '[', '0', ':', '2', ']', ')']
merge_result = merge_tokens(tokenize)
print(merge_result)  # struct.unpack('h', pS[0:2])

Edit:

I found this interesting idea to tokenize and merge.

import re

def tokenize_for_bleu_eval(code):
    tokens_list = []
    codes = code.split(' ')
    for i in range(len(codes)):
        code = codes[i]
        code = re.sub(r'([^A-Za-z0-9_])', r' \1 ', code)
        code = re.sub(r'([a-z])([A-Z])', r'\1 \2', code)
        code = re.sub(r'\s+', ' ', code)
        code = code.replace('"', '`')
        code = code.replace('\'', '`')
        tokens = [t for t in code.split(' ') if t]
        tokens_list.append(tokens)
        if i != len(codes) -1:
            tokens_list.append([' '])
    
    flatten_list = []

    for tokens in tokens_list:
        for token in tokens:
            flatten_list.append(token)

    return flatten_list

def merge_tokens(flatten_list):
    code = ''.join(flatten_list)
    code = code.replace('`', "'")

    return code

test1 ="struct.unpack('h', pS[0:2])"
test2 = "items = [item for item in container if item.attribute == value]"
tokenize = tokenize_for_bleu_eval(test1)
print(tokenize)  # ['struct', '.', 'unpack', '(', '`', 'h', '`', ',', ' ', 'p', 'S', '[', '0', ':', '2', ']', ')']
merge_result = merge_tokens(tokenize)
print(merge_result)  # struct.unpack('h', pS[0:2])

tokenize = tokenize_for_bleu_eval(test2)
print(tokenize)  # ['items', ' ', '=', ' ', '[', 'item', ' ', 'for', ' ', 'item', ' ', 'in', ' ', 'container', ' ', 'if', ' ', 'item', '.', 'attribute', ' ', '=', '=', ' ', 'value', ']']
merge_result = merge_tokens(tokenize)
print(merge_result)  # items = [item for item in container if item.attribute == value]

This script will also remember each space from the input

I'm sorry, I have updated my answer to show why it's not working, my example was not mimimal

Collectives™ on Stack Overflow

Transform a code tokens list into valid string code

1 Answer 1

Edit:

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Edit:

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related