Wednesday, June 22, 2011

Ngrams with coroutines in Python

This is how I define ngrams with coroutines
def coroutine(func):
""" A decorator function that takes care
of starting a coroutine automatically on call """
def start(*args,**kwargs):
coro = func(*args,**kwargs)
coro.next()
return coro
return start
@coroutine
def ngrams(n, target):
""" A coroutine to generate ngrams.
Accepts one char at a time """
chars = collections.deque()
while True:
chars.append((yield))
if len(chars) == n:
target.send(chars)
chars.popleft()
view raw gistfile1.py hosted with ❤ by GitHub

I need to filter text before generating ngrams and also, I want to process ngrams (in this case count bigrams)
@coroutine
def filter_chars(accepted_chars,target):
""" A coroutine to filter out unaccepted chars.
Accepts one char at a time """
while True:
c = (yield)
if c.lower() in accepted_chars:
target.send(c.lower())
@coroutine
def counter(matrix):
""" A counter sink """
while True:
a, b = (yield)
matrix[pos[a]][pos[b]] += 1
view raw gistfile1.py hosted with ❤ by GitHub

I combine my coroutines together
counts = [[10 for i in xrange(k)] for i in xrange(k)]
bigrams = filter_chars(accepted_chars, ngrams(2, counter(counts)))
for c in open('big.txt').read().decode(enc): bigrams.send(c)
view raw gistfile1.py hosted with ❤ by GitHub

Full source can be found in my fork of rrenaud's gibberish detector.

No comments: