Today a friend told me she's struggling with solving the EuroPython 2011 Problem D, called Twibet. I wrote a quick solution and I'm publishing it here for those who are curious about it.
The key to my solution was the recursive traversal (traverse_graph) of a directed graph. The representation of the graph is a dictionary of the monks as keys and a list of the following monks as values.
Because running on the large input file raised a runtime error of maximum recursion depth exceeded, I used a this little hack: sys.setrecursionlimit(10000). UPDATE: Here are the solutions of the contestants.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We're going to demonstrate parallel matrix multiplication in Python.
Lets suppose we are performing the multiplication: P = A * B. We're going to calculate the result matrix row-by-row. The following function returns a row of a result matrix P. The arguments are a row of matrix A and the matrix B.
We're going to use the multiprocessing module of Python to create a process pool containing as many processes as the number of CPUs we have and calculate each result row in a different process. Here are the two lines that does this trick:
Some explanation why is this code a bit nasty. Functions passed to the pool has to be picklable and because only the functions defined at the top level of a module are picklable, we needed to get the the row calculation function calc_row_of_product_matrix out of the class. Furthermore, to be able to pass two arguments to calc_row_of_product_matrix we also needed a helper function, which takes a tuple of a function and args, evaluates and returns the result:
Note that passing around the whole matrix B to all processes calculating result rows, or more precisely passing itertools.repeat(b) to pool.map() visibly increases the memory consumption of the multiprocess version. This was not a real issue for me, as the bottleneck was CPU; anyway, this issue could be addressed by using the shared memory module of multiprocessing. For now we'll leave that as an exercise to the reader.
Here are the running times on my Intel Core2 Duo 3.16GHz box. For sufficiently large matrix (above 500*500) the running time of the multiprocess version is nearly half of the single process version.
100*100 matrix, single process: 0.0835670982343 100*100 matrix, multiprocess: 0.351096555199
200*200 matrix, single process: 0.79961114284 200*200 matrix, multiprocess: 0.700980680681
500*500 matrix, single process: 14.4003259234 500*500 matrix, multiprocess: 7.99582457187 1000*1000 matrix, single process: 118.078526896 1000*1000 matrix, multiprocess: 66.8809939919
Here you can see the single process version running with CPU usage of 50%:
Here you can see the multiprocessing version running with CPU usage of 100%:
Here's the full code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I need to filter text before generating ngrams and also, I want to process ngrams (in this case count bigrams)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
On the other side, the most upvoted Perl question at this moment is about Unicode oddities. And the most upvoted answer will scare the shit out of you.
Recently I've been asked to send a few links to absolute beginners who want to start programming. As a language choice, I always recommend to start with Python. So I ended up to collect a list useful links for Python beginners:
Recently Daniel Varga and me have been working hard in our free-time on the Hunglish project and reached a milestone: the whole system around the sentence aligner have been streamlined and/or rewritten. Also it has been deployed onto a test machine and now can be tried by anyone. This new system is going to replace the old one soon, hence any feedback are highly appreciated.
New features
Help us build a bigger corpus
You can upload a pair documents (one in English, the other one in Hungarian). The uploaded documents will appear in the search results in a few minutes.
The original sentence database contained 2 million sentence pairs, which has been nearly doubled.
Planned features
Upvote / downvote
This is only implemented in the back-end, but we plan to add it to the user-interface.
Other language pairs
We also plan to add other language pairs. You can make a suggestion for a new language pair, bonus points if you recommend a decent open source stemmer (preferably implemented in Java) for new languages.