How do I optimize the speed of my python compression code?

Question

I have made a compression code, and have tested it on 10 KB text files, which took no less than 3 minutes. However, I've tested it with a 1 MB file, which is the assessment assigned by my teacher, and it takes longer than half an hour. Compared to my classmates, mine is irregularly long. It might be my computer or my code, but I have no idea. Does anyone know any tips or shortcuts into making the speed of my code shorter? My compression code is below, if there are any quicker ways of doing loops, etc. please send me an answer (:

(by the way my code DOES work, so I'm not asking for corrections, just enhancements, or tips, thanks!)

import re #used to enable functions(loops, etc.) to find patterns in text file
import os #used for anything referring to directories(files)
from collections import Counter #used to keep track on how many times values are added

size1 = os.path.getsize('file.txt') #find the size(in bytes) of your file,    INCLUDING SPACES
print('The size of your file is ', size1,)

words = re.findall('\w+', open('file.txt').read()) 
wordcounts = Counter(words) #turns all words into array, even capitals 
common100 = [x for x, it in Counter(words).most_common(100)] #identifies the 200 most common words

keyword = []
kcount = []
z = dict(wordcounts)
for key, value in z.items():
    keyword.append(key) #adds each keyword to the array called keywords
    kcount.append(value)

characters =['$','#','@','!','%','^','&','*','(',')','~','-','/','{','[', ']', '+','=','}','|', '?','cb',
         'dc','fd','gf','hg','kj','mk','nm','pn','qp','rq','sr','ts','vt','wv','xw','yx','zy','bc',
         'cd','df','fg','gh','jk','km','mn','np','pq','qr','rs','st','tv','vw','wx','xy','yz','cbc',
         'dcd','fdf','gfg','hgh','kjk','mkm','nmn','pnp','qpq','rqr','srs','tst','vtv','wvw','xwx',
         'yxy','zyz','ccb','ddc','ffd','ggf','hhg','kkj','mmk','nnm','ppn','qqp','rrq','ssr','tts','vvt',
         'wwv','xxw','yyx''zzy','cbb','dcc','fdd','gff','hgg','kjj','mkk','nmm','pnn','qpp','rqq','srr',
         'tss','vtt','wvv','xww','yxx','zyy','bcb','cdc','dfd','fgf','ghg','jkj','kmk','mnm','npn','pqp',
         'qrq','rsr','sts','tvt','vwv','wxw','xyx','yzy','QRQ','RSR','STS','TVT','VWV','WXW','XYX','YZY',
        'DC','FD','GF','HG','KJ','MK','NM','PN','QP','RQ','SR','TS','VT','WV','XW','YX','ZY','BC',
         'CD','DF','FG','GH','JK','KM','MN','NP','PQ','QR','RS','ST','TV','VW','WX','XY','YZ','CBC',
         'DCD','FDF','GFG','HGH','KJK','MKM','NMN','PNP','QPQ','RQR','SRS','TST','VTV','WVW','XWX',
         'YXY','ZYZ','CCB','DDC','FFD','GGF','HHG','KKJ','MMK','NNM','PPN','QQP','RRQ','SSR','TTS','VVT',
         'WWV','XXW','YYX''ZZY','CBB','DCC','FDD','GFF','HGG','KJJ','MKK','NMM','PNN','QPP','RQQ','SRR',
         'TSS','VTT','WVV','XWW','YXX','ZYY','BCB','CDC','DFD','FGF','GHG','JKJ','KMK','MNM','NPN','PQP',] #characters which I can use

symbols_words = []
char = 0
for i in common100:
    symbols_words.append(characters[char]) #makes the array literally contain 0 values
        char = char + 1

print("Compression has now started")

f = 0
g = 0
no = 0
while no < 100:
    for i in common100:
        for w in words:
            if i == w and len(i)>1: #if the values in common200 are ACTUALLY in words
                place = words.index(i)#find exactly where the most common words are in the text
                symbols = symbols_words[common100.index(i)] #assigns one character with one common word
                words[place] = symbols # replaces the word with the symbol
                g = g + 1
    no = no + 1


string = words
stringMade = ' '.join(map(str, string))#makes the list into a string so you can put it into a text file
file = open("compression.txt", "w")
file.write(stringMade)#imports everything in the variable 'words' into the new file
file.close()

size2 = os.path.getsize('compression.txt')

no1 = int(size1)
no2 = int(size2)
print('Compression has finished.')
print('Your original file size has been compressed by', 100 - ((100/no1) * no2 ) ,'percent.'
  'The size of your file now is ', size2)

To make it do the loop 100 times, so that each time it encrypts the x most common word @Tom Dalton — anon_8080
– anon_8080, Commented May 16, 2016 at 14:21
I don't think so - unless I'm going blind, the variable no is not used inside that loop at all. — Tom Dalton
– Tom Dalton, Commented May 16, 2016 at 14:58
You have for i in common100 that gives you the iteration over the actual 100 most-common words. — Tom Dalton
– Tom Dalton, Commented May 16, 2016 at 14:59

Tom Dalton · Accepted Answer · 2016-05-16 16:40:16Z

1

Using something like

word_substitutes = dict(zip(common100, characters))

will give you a dict that maps common words to their corresponding symbol.

Then you can simply iterate over the words:

# Iterate over all the words
# Use enumerate because we're going to modify the word in-place in the words list
for word_idx, word in enumerate(words):
    # If the current word is in the `word_substitutes` dict, then we know its in the
    # 'common' words, and can be replaced by the symbol
    if word in word_substitutes:
        # Replaces the word in-place
        replacement_symbol = word_substitutes[word]
        words[word_idx] = replacement_symbol

This will give much better performance, because the dictionary lookup used for the common word symbol mapping is logarithmic in time rather than linear. So the overall complexity will be something like O(N log(N)) rather than O(N^3) that you get from the 2 nested loops with the .index() call inside that.

edited May 16, 2016 at 16:40

answered May 16, 2016 at 16:33

Tom Dalton

6,19027 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

anon_8080 Over a year ago

Thank you! Actually you were right i didnt need need the while loop, so before it took 45 minutes, now it only takes 5 minutes! Thank you very much @ TomDalton

Tom Dalton Over a year ago

5 minutes, down from 45? I would have expected it to be a much bigger improvement! How big is your 'words' file?

gnicholas · Accepted Answer · 2016-05-15 18:52:10Z

1

The first thing I see that is bad for performance is:

for i in common100:
    for w in words:
        if i == w and len(i)>1:
            ...

What you are doing is seeing if the word w is in your list of common100 words. However, this check can be done in O(1) time by using a set and not looping through all of your top 100 words for each word.

common_words = set(common100)
for w in words:
    if w in common_words:
        ...

answered May 15, 2016 at 18:52

gnicholas

2,0871 gold badge23 silver badges33 bronze badges

1 Comment

anon_8080 Over a year ago

oh wait I just tested it on 200 words and it compresses it only by 5%, and before it compressed by 20%, umm.... @gnicholas

Community · Accepted Answer · 2017-05-23 12:31:40Z

0

Generally you would do the following:

Measure how much time each "part" of your program needs. You could use a profiler (e.g. this one in the standard library) or simply sprinkle some times.append(time.time.now) into your code and compute differences. Then you know which part of your code is slow.
See if you can improve the algorithm of the slow part. gnicholas answer shows one possibility to speed things up. The while no<=100 seems suspiciously, maybe that can be improved. This step needs understanding of the algorithms you use. Be careful to select the best data structures for your use case.
If you can't use a better algorithm (because you always use the best way to calculate something) you need to speed up the computations themselves. Numerical stuff benefits from numpy, with cython you can basically compile python code to C and numba uses LLVM to compile.

edited May 23, 2017 at 12:31

CommunityBot

11 silver badge

answered May 15, 2016 at 18:58

syntonym

7,3942 gold badges34 silver badges46 bronze badges

1 Comment

anon_8080 Over a year ago

Thank you! I have taken your advice into consideration. @syntonym

Collectives™ on Stack Overflow

How do I optimize the speed of my python compression code?

3 Answers 3

2 Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related