0

My code is to count the tokens. It takes input from text files. I want to take input from excel but how? Another problem is that the code does not work for a relatively large data. How do I optimize it?

# start
import codecs
from collections import Counter
def count_char(Input):

#Input file
fi= codecs.open('G:\python-proj/text.txt' , 'r',encoding='utf-8')
Input= fi.read()

#count words
spli=Input.split()
freq = Counter(spli)
sum=len(list(freq.elements()))
print('Total Tokenes:\n ')       
print(sum)
print('\n')
count_char(Input)
1
  • Regarding reading Excel files your question needs to be more targeted. Start at python-excel.org, pick the library that seems most appropriate, try to write some code, search StackOverflow for answers, and then post any issue that you encounter as a question along with your code showing what you have tried. See How do I ask a good question in the Help Center. I have answered the other part of your question below. Commented Feb 17, 2017 at 17:10

2 Answers 2

1

Input= fi.read() reads the whole file into memory. That's why large files are tripping you up. The solution is to read line by line.

Large files can still trip you up because you are saving the words in a Counter object. If there are few duplicates then that object will get very large. If duplicates are common memory will not be an issue.

Whatever you do don't call list(someCounter.elements()) when someCounter has a large number of counts. It will return a very large list. (If someCounter = Counter({'redrum': 100000}) then list(someCounter.elements()) would give you a list with 100000 elements!)

char_count = 0
word_counter = Counter()
with codecs.open('G:\python-proj/text.txt' , 'r',encoding='utf-8') as f:
    for line in f:
        char_count += len(line)
        word_counter.update(line.split())
unique_word_count = len(word_counter)
total_word_count = sum(word_counter.itervalues())

Do note that using line.split() may result in some words being counted as unique that you wouldn't consider to be unique. Consider:

>>> line = 'Red cars raced red cars.\n'
>>> Counter(line.split())
Counter({'cars': 1, 'cars.': 1, 'raced': 1, 'red': 1, 'Red': 1})

If we want 'red' and 'Red' to be counted together irrespective of capitalization we can do this:

>>> line = 'Red cars raced red cars.\n'
>>> Counter(s.lower().split()) # everything is made lowercase before counting
Counter({'red': 2, 'cars': 1, 'cars.': 1, 'raced': 1})

If we want 'cars' and 'cars.' to be counted together regardless of punctuation we strip the punctuation like so:

>>> import string
>>> punct = string.punctuation
>>> line = 'Red cars raced red cars.\n'
>>> Counter(word.strip(punct) for word in line.lower().split())
Counter({'cars': 2, 'red': 2, 'raced': 1})

Regarding reading Excel files your question needs to be more targeted. Start at python-excel.org, pick the library that seems most appropriate, try to write some code, search StackOverflow for answers, and then post any issue that you encounter as a question along with your code showing what you have tried.

Sign up to request clarification or add additional context in comments.

2 Comments

I use Counter because I didn't want to count duplicate tokens. I am bigenner in python.
@kossaaar: Updated answer.
1

1) You could save the excel file as a .csv and Python's built-in CSV reader to parse it.

2) It's slow on large datasets because you're reading the whole file into memory at once with fi.read(). You could count the tokens on each line:

for line in fi.read(): do something with (line.split())

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.