0

My python2 code(psp):

input = form.getfirst('input')
row = cgi.escape(input)

f = open(filename, 'a')
f.write('"' + row + '",\n')
f.close()

generate this:

        "python - питон, прорицатель",
        "cobra - кобра, очковая змея",

I try to read this data with python3. Default codepage of python3 is UTF-8.

"python - питон, прорицатель",

"cobra - кобра, очковая змея",

Me need to count number of Russian symbols. But counts number of spec symbols: '&', '#', ';', and numbers.

How to decode 'xmlcharrefreplace' ascii to UTF-8, to get compare it with hard-coded Russian symbols in python3(UTF-8) code:

#!/usr/bin/python3

import sys

print(sys.getdefaultencoding())
print(sys.stdout.encoding)

ru_abc = set(['а', 'б', 'в', 'г', 'д', 'е', 'ё', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'п', 'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ч', 'ш', 'щ', 'ъ', 'ы', 'ь', 'э', 'ю', 'я'])
stat_data = { 'other': {}, \
            'russian': {} }

for letter in open( filename ).read():
    if letter in ru_abc:
        if letter in stat_data['russian']:
           stat_data['russian'][letter] += 1
        else:
           stat_data['russian'][letter] = 1
    else:
        if letter in stat_data['other']:
           stat_data['other'][letter] += 1
        else:
           stat_data['other'][letter] = 1
print( stat_data )

My stdout looks like:

 utf-8
 UTF-8
 {'russian': {}, 'other': {'~': 3, '=': 169, '<': 300, '?': 473, '>': 312, ';': 318392, ':': 222, '%': 29, "'": 31, '&': 318409, '!': 36, ' ': 51427, '#': 318390, '"': 320, '-': 9822, ',': 21578, '/': 843, '.': 800, ')': 527, '(': 526, '+': 2, ']': 8, '_': 117, '[': 8, '|': 1, '\r': 224, '\n': 224, '\t': 38, '`': 3, '5': 31451, '4': 23216, '7': 131141, '6': 40036, '1': 352560, '0': 373246, '3': 25196, '2': 37785, '9': 81825, '8': 177608, 'u': 3354, 't': 7281, 'w': 1179, 'v': 1074, 'q': 214, 'p': 2966, 's': 5816, 'r': 6948, 'y': 1714, 'x': 318, 'z': 222, 'e': 10841, 'd': 2918, 'g': 1996, 'f': 1801, 'a': 7069, 'c': 4020, 'b': 1805, 'm': 2337, 'l': 4821, 'o': 5906, 'n': 6307, 'i': 8068, 'h': 2559, 'k': 902, 'j': 142}}

1 Answer 1

2

Use the html.parser.HTMLParser() class:

from html.parser import HTMLParser

parser = HTMLParser()

with open(filename) as fileobj:
    for line in fileobj:
        line = parser.unescape(line)

Demo:

>>> parser.unescape('        "python - &#1087;&#1080;&#1090;&#1086;&#1085;, &#1087;&#1088;&#1086;&#1088;&#1080;&#1094;&#1072;&#1090;&#1077;&#1083;&#1100;",')
'        "python - питон, прорицатель",'

I'd use a collections.Counter() object to count the characters:

from collections import Counter
from html.parser import HTMLParser

ru_abc = set('абвгдеёжзийклмнопрстуфхцчшщъыьэюя')
parser = HTMLParser()
stat_data = {'other': Counter(), 'russian': Counter()}


with open(filename) as fileobj:
    for line in fileobj:
        line = parser.unescape(line)
        stat_data['russian'].update(c for c in line if c in ru_abc)
        stat_data['other'].update(c for c in line if c not in ru_abc)

Result:

{
    'other': Counter({' ': 23, ',': 4, '"': 4, '\n': 3, 'o': 2, '-': 2, 'y': 1, 't': 1, 'b': 1, 'r': 1, 'p': 1, 'n': 1, 'h': 1, 'c': 1, 'a': 1}),
    'russian': Counter({'о': 5, 'а': 3, 'р': 3, 'п': 2, 'к': 2, 'и': 2, 'е': 2, 'я': 2, 'т': 2, 'н': 1, 'м': 1, 'л': 1, 'з': 1, 'в': 1, 'б': 1, 'ь': 1, 'ч': 1, 'ц': 1})
}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.