4

I'm very new to Python and I'm sure there is a much easier way to accomplish what I need but here goes.

I'm trying to create a program which performs frequency analysis on a list of letters called inputList and retrives the 2 letter pairs and adds them to another dictionary. So I need it to populate a second dictonary with all the 2 letter pairs.

I have a rough idea how I can do this but am I bit stuck with the syntax to make it work.

for bigram in inputList:
    bigramDict[str(bigram + bigram+1)] =  1

Where bigram+1 is the letter in the next iteration

As an example if I was to have the text "stackoverflow" in the inputList I need to to first put the letters "st" as the key and 1 as the value. On the second iteration "ta" as the key and so on. The problem I'm having is retriving the value the variable will be on the next iteration without moving to the next iteration.

I hope I explained myself clearly. Thanks for your help

0

4 Answers 4

5

A straightforward way to obtain n-grams for a sequence is slicing:

def ngrams(seq, n=2):
    return [seq[i:i+n] for i in range(len(seq) - n + 1)]

Combine this with collections.Counter and you're ready:

from collections import Counter
print Counter(ngrams("abbabcbabbabr"))

In case you need ngrams() to be lazy:

from collections import deque

def ngrams(it, n=2):
    it = iter(it)
    deq = deque(it, maxlen=n)
    yield tuple(deq)
    for p in it:
        deq.append(p)
        yield tuple(deq)

(See below for more elegant code for the latter).

Sign up to request clarification or add additional context in comments.

2 Comments

Is string subscripting in Python an O(1) operation or an O(n) operation? This is either incredible elegant or incredibly slow...
... Well, it ran on 14 megabytes of input quickly enough. It must be O(1) and thus this must be elegant. :D
3

Use zip to zip string to copy of itself offset by 1

Get bigraphs like this:

s = "stackoverflow"
zip(s,s[1:])

Gives:

[('s', 't'), ('t', 'a'), ('a', 'c'), ('c', 'k'), ('k', 'o'), ('o', 'v'), ('v', 'e'), ('e', 'r'), ('r', 'f'), ('f', 'l'), ('l', 'o'), ('o', 'w')]

Trigraphs are also easy:

zip(s,s[1:],s[2:])

Gives:

[('s', 't', 'a'), ('t', 'a', 'c'), ('a', 'c', 'k'), ('c', 'k', 'o'), ('k', 'o', 'v'), ('o', 'v', 'e'), ('v', 'e', 'r'), ('e', 'r', 'f'), ('r', 'f', 'l'), ('f', 'l', 'o'), ('l', 'o', 'w')]

You can use the tuples as the keys for your dictionary ... or better still use the Counter or default_dict objects for doing the counts. Good luck!

Comments

3
from collections import Counter
from itertools import islice, izip, tee

def pairs(iterable):
    a, b = tee(iterable)
    for pair in izip(a, islice(b, 1, None)):
        yield pair

print Counter(pairs("stackoverflow"))

Or a simpler version:

def pairs(iterable):
    it = iter(iterable)
    last = next(it)
    for c in it:
        yield last, c
        last = c

A generalized version for arbitrary n:

def ngrams(iterable, n=2):
    return izip(*[islice(it, i, None) for i, it in enumerate(tee(iterable, n))])

2 Comments

Nice, but how about arbitrary n-grams? I have a strong feeling there must be an itertools oneliner for that.
@thg435: I've posted generalized version
1

Keep a variable of the previous letter? First iteration you just fetch first letter and do nothing else.

ADDENDUM: This method, at the very least, doesn't need to waste any more memory than a simple variable to store one letter, no excess tuples or anything.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.