0

I have made a compression code, and have tested it on 10 KB text files, which took no less than 3 minutes. However, I've tested it with a 1 MB file, which is the assessment assigned by my teacher, and it takes longer than half an hour. Compared to my classmates, mine is irregularly long. It might be my computer or my code, but I have no idea. Does anyone know any tips or shortcuts into making the speed of my code shorter? My compression code is below, if there are any quicker ways of doing loops, etc. please send me an answer (:

(by the way my code DOES work, so I'm not asking for corrections, just enhancements, or tips, thanks!)

import re #used to enable functions(loops, etc.) to find patterns in text file
import os #used for anything referring to directories(files)
from collections import Counter #used to keep track on how many times values are added

size1 = os.path.getsize('file.txt') #find the size(in bytes) of your file,    INCLUDING SPACES
print('The size of your file is ', size1,)

words = re.findall('\w+', open('file.txt').read()) 
wordcounts = Counter(words) #turns all words into array, even capitals 
common100 = [x for x, it in Counter(words).most_common(100)] #identifies the 200 most common words

keyword = []
kcount = []
z = dict(wordcounts)
for key, value in z.items():
    keyword.append(key) #adds each keyword to the array called keywords
    kcount.append(value)

characters =['$','#','@','!','%','^','&','*','(',')','~','-','/','{','[', ']', '+','=','}','|', '?','cb',
         'dc','fd','gf','hg','kj','mk','nm','pn','qp','rq','sr','ts','vt','wv','xw','yx','zy','bc',
         'cd','df','fg','gh','jk','km','mn','np','pq','qr','rs','st','tv','vw','wx','xy','yz','cbc',
         'dcd','fdf','gfg','hgh','kjk','mkm','nmn','pnp','qpq','rqr','srs','tst','vtv','wvw','xwx',
         'yxy','zyz','ccb','ddc','ffd','ggf','hhg','kkj','mmk','nnm','ppn','qqp','rrq','ssr','tts','vvt',
         'wwv','xxw','yyx''zzy','cbb','dcc','fdd','gff','hgg','kjj','mkk','nmm','pnn','qpp','rqq','srr',
         'tss','vtt','wvv','xww','yxx','zyy','bcb','cdc','dfd','fgf','ghg','jkj','kmk','mnm','npn','pqp',
         'qrq','rsr','sts','tvt','vwv','wxw','xyx','yzy','QRQ','RSR','STS','TVT','VWV','WXW','XYX','YZY',
        'DC','FD','GF','HG','KJ','MK','NM','PN','QP','RQ','SR','TS','VT','WV','XW','YX','ZY','BC',
         'CD','DF','FG','GH','JK','KM','MN','NP','PQ','QR','RS','ST','TV','VW','WX','XY','YZ','CBC',
         'DCD','FDF','GFG','HGH','KJK','MKM','NMN','PNP','QPQ','RQR','SRS','TST','VTV','WVW','XWX',
         'YXY','ZYZ','CCB','DDC','FFD','GGF','HHG','KKJ','MMK','NNM','PPN','QQP','RRQ','SSR','TTS','VVT',
         'WWV','XXW','YYX''ZZY','CBB','DCC','FDD','GFF','HGG','KJJ','MKK','NMM','PNN','QPP','RQQ','SRR',
         'TSS','VTT','WVV','XWW','YXX','ZYY','BCB','CDC','DFD','FGF','GHG','JKJ','KMK','MNM','NPN','PQP',] #characters which I can use

symbols_words = []
char = 0
for i in common100:
    symbols_words.append(characters[char]) #makes the array literally contain 0 values
        char = char + 1

print("Compression has now started")

f = 0
g = 0
no = 0
while no < 100:
    for i in common100:
        for w in words:
            if i == w and len(i)>1: #if the values in common200 are ACTUALLY in words
                place = words.index(i)#find exactly where the most common words are in the text
                symbols = symbols_words[common100.index(i)] #assigns one character with one common word
                words[place] = symbols # replaces the word with the symbol
                g = g + 1
    no = no + 1


string = words
stringMade = ' '.join(map(str, string))#makes the list into a string so you can put it into a text file
file = open("compression.txt", "w")
file.write(stringMade)#imports everything in the variable 'words' into the new file
file.close()

size2 = os.path.getsize('compression.txt')

no1 = int(size1)
no2 = int(size2)
print('Compression has finished.')
print('Your original file size has been compressed by', 100 - ((100/no1) * no2 ) ,'percent.'
  'The size of your file now is ', size2)
9
  • This question belongs to codereview.stackexchange.com Commented May 15, 2016 at 18:21
  • What is the while no < 100: loop for? Commented May 15, 2016 at 18:21
  • To make it do the loop 100 times, so that each time it encrypts the x most common word @Tom Dalton Commented May 16, 2016 at 14:21
  • I don't think so - unless I'm going blind, the variable no is not used inside that loop at all. Commented May 16, 2016 at 14:58
  • You have for i in common100 that gives you the iteration over the actual 100 most-common words. Commented May 16, 2016 at 14:59

3 Answers 3

1

Using something like

word_substitutes = dict(zip(common100, characters))

will give you a dict that maps common words to their corresponding symbol.

Then you can simply iterate over the words:

# Iterate over all the words
# Use enumerate because we're going to modify the word in-place in the words list
for word_idx, word in enumerate(words):
    # If the current word is in the `word_substitutes` dict, then we know its in the
    # 'common' words, and can be replaced by the symbol
    if word in word_substitutes:
        # Replaces the word in-place
        replacement_symbol = word_substitutes[word]
        words[word_idx] = replacement_symbol

This will give much better performance, because the dictionary lookup used for the common word symbol mapping is logarithmic in time rather than linear. So the overall complexity will be something like O(N log(N)) rather than O(N^3) that you get from the 2 nested loops with the .index() call inside that.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! Actually you were right i didnt need need the while loop, so before it took 45 minutes, now it only takes 5 minutes! Thank you very much @ TomDalton
5 minutes, down from 45? I would have expected it to be a much bigger improvement! How big is your 'words' file?
1

The first thing I see that is bad for performance is:

for i in common100:
    for w in words:
        if i == w and len(i)>1:
            ...

What you are doing is seeing if the word w is in your list of common100 words. However, this check can be done in O(1) time by using a set and not looping through all of your top 100 words for each word.

common_words = set(common100)
for w in words:
    if w in common_words:
        ...

1 Comment

oh wait I just tested it on 200 words and it compresses it only by 5%, and before it compressed by 20%, umm.... @gnicholas
0

Generally you would do the following:

  • Measure how much time each "part" of your program needs. You could use a profiler (e.g. this one in the standard library) or simply sprinkle some times.append(time.time.now) into your code and compute differences. Then you know which part of your code is slow.
  • See if you can improve the algorithm of the slow part. gnicholas answer shows one possibility to speed things up. The while no<=100 seems suspiciously, maybe that can be improved. This step needs understanding of the algorithms you use. Be careful to select the best data structures for your use case.
  • If you can't use a better algorithm (because you always use the best way to calculate something) you need to speed up the computations themselves. Numerical stuff benefits from numpy, with cython you can basically compile python code to C and numba uses LLVM to compile.

1 Comment

Thank you! I have taken your advice into consideration. @syntonym

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.