how to read excel data and optimize code in python

Question

My code is to count the tokens. It takes input from text files. I want to take input from excel but how? Another problem is that the code does not work for a relatively large data. How do I optimize it?

# start
import codecs
from collections import Counter
def count_char(Input):

#Input file
fi= codecs.open('G:\python-proj/text.txt' , 'r',encoding='utf-8')
Input= fi.read()

#count words
spli=Input.split()
freq = Counter(spli)
sum=len(list(freq.elements()))
print('Total Tokenes:\n ')       
print(sum)
print('\n')
count_char(Input)

Regarding reading Excel files your question needs to be more targeted. Start at python-excel.org, pick the library that seems most appropriate, try to write some code, search StackOverflow for answers, and then post any issue that you encounter as a question along with your code showing what you have tried. See How do I ask a good question in the Help Center. I have answered the other part of your question below. — Steven Rumbalski
– Steven Rumbalski, Commented Feb 17, 2017 at 17:10

Steven Rumbalski · Accepted Answer · 2017-02-17 18:13:27Z

1

Input= fi.read() reads the whole file into memory. That's why large files are tripping you up. The solution is to read line by line.

Large files can still trip you up because you are saving the words in a Counter object. If there are few duplicates then that object will get very large. If duplicates are common memory will not be an issue.

Whatever you do don't call list(someCounter.elements()) when someCounter has a large number of counts. It will return a very large list. (If someCounter = Counter({'redrum': 100000}) then list(someCounter.elements()) would give you a list with 100000 elements!)

char_count = 0
word_counter = Counter()
with codecs.open('G:\python-proj/text.txt' , 'r',encoding='utf-8') as f:
    for line in f:
        char_count += len(line)
        word_counter.update(line.split())
unique_word_count = len(word_counter)
total_word_count = sum(word_counter.itervalues())

Do note that using line.split() may result in some words being counted as unique that you wouldn't consider to be unique. Consider:

>>> line = 'Red cars raced red cars.\n'
>>> Counter(line.split())
Counter({'cars': 1, 'cars.': 1, 'raced': 1, 'red': 1, 'Red': 1})

If we want 'red' and 'Red' to be counted together irrespective of capitalization we can do this:

>>> line = 'Red cars raced red cars.\n'
>>> Counter(s.lower().split()) # everything is made lowercase before counting
Counter({'red': 2, 'cars': 1, 'cars.': 1, 'raced': 1})

If we want 'cars' and 'cars.' to be counted together regardless of punctuation we strip the punctuation like so:

>>> import string
>>> punct = string.punctuation
>>> line = 'Red cars raced red cars.\n'
>>> Counter(word.strip(punct) for word in line.lower().split())
Counter({'cars': 2, 'red': 2, 'raced': 1})

Regarding reading Excel files your question needs to be more targeted. Start at python-excel.org, pick the library that seems most appropriate, try to write some code, search StackOverflow for answers, and then post any issue that you encounter as a question along with your code showing what you have tried.

edited Feb 17, 2017 at 18:13

answered Feb 17, 2017 at 16:50

Steven Rumbalski

45.8k10 gold badges96 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

kossaaar Over a year ago

I use Counter because I didn't want to count duplicate tokens. I am bigenner in python.

Steven Rumbalski Over a year ago

@kossaaar: Updated answer.

c2huc2hu · Accepted Answer · 2017-02-17 16:47:49Z

1

1) You could save the excel file as a .csv and Python's built-in CSV reader to parse it.

2) It's slow on large datasets because you're reading the whole file into memory at once with fi.read(). You could count the tokens on each line:

for line in fi.read(): do something with (line.split())

answered Feb 17, 2017 at 16:47

c2huc2hu

2,49720 silver badges28 bronze badges

Collectives™ on Stack Overflow

how to read excel data and optimize code in python

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related