Thursday, October 4, 2012

Use Python to extract date of birth from Wikipedia pages

Religion cybernetics and getting things done

Two days ago we had this conversation about astrology and a friend mentioned it would be interesting, from a "religion cybernetics" aspect, to see the list physicists who has the zodiac Taurus. Religion cybernetics: I'd heard this term many times, and I was still puzzled what the heck this would be, but this time I had a second thought: I was wondering how long would it take to actually get this done and produce such a list from the Wikipedia.

Well, here you go, eventually I did this during my daily telecommuting and in my spare minutes yesterday and today. All in all, it took cc. 2-3 hours which is not bad.

The export format

I decided in the first place, that I will start from the List of physicists and visit each individual page and extract the date of birth. I knew the Wikipedia can be downloaded but I was too lazy to download the whole thing. On the other hand, I also knew that it is not nice to crawl the human readable HTML version of the site, and I found the middle ground: the export format, which is basically the wiki content wrapped up in some minimal XML metadata.

The date of birth

After a short investigation, it turned out the date of birth can be found usually at three different places.

1) Somewhere in the article text.
E.g. Johann Jakob Balmer's page starts with this: "Johann Jakob Balmer (May 1, 1825 – March 12, 1898)"

2) In the so called infobox, 
which is on the right side of the page. It can contain the dates in different formats, see the birth_date and death_date for example.
"{{Infobox scientist 
|name = Johann Jakob Balmer 
|image = Balmer.jpeg 
|image_size = 220px 
|caption = 
|birth_date = May 1, 1825 
|birth_place = [[Lausen]], [[Switzerland]] 
|death_date = {{dda|1898|3|12|1825|5|1}} 
|death_place = [[Basel]], [[Switzerland]] 
|residence = 
|citizenship = 
|nationality = [[Switzerland]] 
|ethnicity = 
|field = [[Mathematics]] 
|work_institutions = 
|alma_mater = [[University of Basel]] 
|doctoral_advisor = 
|doctoral_students = 
|known_for = 
|author_abbrev_bot = 
|author_abbrev_zoo = 
|influences = 
|influenced = 
|prizes = 
|religion = 
|footnotes = 
|signature = }}"


3) Persondata
I have learnt, that every biographical article have some metadata, that is not visible on the human readable page. It's purpose is exactly to help my work, that is help the automatic extraction of information. What I did not expect, this can contain the dates in different formats as well.
"{{Persondata  
| NAME = Balmer, Johann 
| ALTERNATIVE NAMES = 
| SHORT DESCRIPTION = 
| DATE OF BIRTH = May 1, 1825 
| PLACE OF BIRTH = [[Lausen]], [[Switzerland]] 
| DATE OF DEATH = March 12, 1898 
| PLACE OF DEATH = [[Basel]] 
}}

The cache

Since I knew I will run the whole process several times until I get everything right, I decided to save every webpage I download, so that I do not  have to download them unnecessary several times. I also saved the dictionaries I have built into files, so that later I can load them and rerun a partial step of the process without running every other preceding step.

Epilog

Just after I have finished writing up this, I've made one last Google search. It's a shame I only read this Stackoverflow thread after I have finished. I should have started with that. There are several tools that I could have used:
Pywikipediabot: collection of python scripts automating work on wikipedia articles
- How to extract Persondata from the SQL dump files
- Wikidata
- this is the worst: forget export format. there is an API which lets you retrieve data in JSON format. Next time start here: http://www.mediawiki.org/wiki/API/Tutorial
- there is a Wiki format parser module that will return the date of birth
I could rewrite this in a few lines of code. But I won't do this now, because it is DONE. Next time I grab info from Wikipedia, I will be smarter.

The code:

Physicists grouped by zodiac

Physicists grouped by zodiac

Aqu
Ernst Mach
Wander Johannes de Haas
Albert Einstein
Val Logsdon Fitch
Steven Chu
George Gamow
Ernest William Titterton
René Antoine Ferchault de Réaumur
Jean Charles Athanase Peltier|Jean Peltier
Georgy Flyorov
Ludwig Boltzmann
Ernest Esclangon
Mahmoud Hessaby
Edward Condon
Leon Cooper
Lev Artsimovich
Alan Guth
Giovanni Battista Venturi
David Gross
George Smoot
Evgeny Lifshitz
Heinrich Kayser
Allan McLeod Cormack
Georg Ohm
Simon Newcomb
Lochlainn O'Raifeartaigh
Joseph von Fraunhofer
Karl Herzfeld
Gunnar Nordström
Geoffrey Ingram Taylor
Albert Fert
Rashid Sunyaev
Ralph Kronig
George William Hill
Heinz Pagels
Pieter van Musschenbroek
Robert Serber
Yakov Borisovich Zel'dovich
Nicolaas Bloembergen
Georges Charpak
Anthony Ichiro Sanda
Otto Hahn
Otto Scherzer
Daniel C. Tsui|Daniel Chee Tsui
Vilhelm Bjerknes
Yakov Lvovich Alpert
Marie Alfred Cornu
Terence James Elkins
Dmitry Shirkov
Walter Marshall, Baron Marshall of Goring|Walter Marshall
Abraham Alikhanov
Frederick Reines
Richard Dalitz
Leo Esaki
Thomas Townsend Brown
Brian Cox (physicist)|Brian Cox
Robert R. Wilson
Thanu Padmanabhan
Alexander Graham Bell
Benoît Paul Émile Clapeyron|Benoît Clapeyron

Tau
Dennis Gabor
John Cockcroft
Oliver Lodge
Chien-Shiung Wu
Charles Glover Barkla
Robert S. Mulliken
Edward Ramberg
Herbert L. Anderson
Lisa Randall
George Sterman
John Robert Schrieffer
Kenneth G. Wilson|Kenneth Geddes Wilson
Marian Smoluchowski
Joseph Lykken
Gaspard-Gustave Coriolis
Boris Chirikov
Jack Steinberger
Georges-Louis Le Sage|Georges-Louis le Sage
Geoffrey Chew
Gerald Feinberg
Peter Higgs
Andrei Sakharov|Andrei Dmitrievich Sakharov
Kazimierz Fajans
Charlotte Riefenstahl|Charlotte (née Riefenstahl) Houtermans
Charlotte Riefenstahl
William Gilbert (astronomer)|William Gilbert
Nathan Isgur
Charles-Augustin de Coulomb
Heinrich Rohrer
Maxim Chernodub
Pieter Zeeman
Andrei Sakharov
Andrea M. Ghez
John Bardeen
John Herapath
Carl Hermann
Luis Walter Alvarez
Bertha Swirles
Jean Henri van Swinden
Nicolas Léonard Sadi Carnot
Aage Bohr
Thomas Young (scientist)|Thomas Young
Rudolf Peierls
John Sealy Townsend|John Townsend
Hagen Kleinert
Philipp Lenard
Sam Treiman
Luigi Puccianti
Ludwig Waldmann

Scorp
Manne Siegbahn
Russell Alan Hulse
Fred Alan Wolf
Louis Néel
John Kerr (physicist)|John Kerr
Dmitry Zubarev
Gustaf Dalén
Henri Becquerel
Henry Moseley
Tsung-Dao Lee
Mitchell Feigenbaum
Werner Heisenberg|Werner Karl Heisenberg
Gurgen Askaryan
Alain Haché
Sylvester James Gates
Nikolay Basov
David Bohm
Philip Warren Anderson
Jean Dalibard
Tycho Brahe
James Rainwater
Isaac Beeckman
Robert Adler
Charles Galton Darwin
Robert Döpel
Franz Aepinus
J. J. Thomson
Sheldon Lee Glashow
Eric Allin Cornell
Edwin Ernest Salpeter
Jagadish Chandra Bose
Ernst Chladni
Friedrich Hasenöhrl
Steven Frautschi
Joseph Henry
Louis Slotin
Abraham Bennet
Simon van der Meer
Louis Rendu
John Henry Schwarz
Max Born
Henry Way Kendall
Marcel Brillouin
Arnold Sommerfeld
Peter Andreas Hansen
Johannes Diderik van der Waals
Christian Doppler
Joseph Louis Gay-Lussac
Nikolaus Riehl
Lars Onsager
David Brewster

Lib
William Allis
Johannes Bosscha
Samuel King Allison
Michael R. Douglas
Homi J. Bhabha
Jack Kilby|Jack St. Clair Kilby
Samuel Tolansky
Melvin Schwartz
Johannes Rydberg|Janne Rydberg
G. B. Pegram
Theodor Kaluza
Piara Singh Gill
Alfred Lee Loomis
Joseph Swan|Joseph Wilson Swan
Richard E. Taylor|Richard Edward Taylor
Mikhail Lavrentyev
G. N. Glasoe
Revaz Dogonadze
Dennis William Sciama
Lise Meitner
Mark B. Wise
Jean le Rond d'Alembert
Pierre-Gilles de Gennes
Gregory Jaczko
Per-Olov Löwdin
Ward Plummer
Wilhelm Eduard Weber|Wilhelm Weber
Mikhail Lomonosov
Edwin Hall
Igor Ternov
Ilya Frank
Marie Curie
Gleb Wataghin
Richard Keith Ellis|R. Keith Ellis
Francis G. Slack
John Lennard-Jones
Theodor W. Hänsch|Theodor Wolfgang Hänsch
William Daniel Phillips
Abram Ioffe
Robert B. Laughlin|Robert Betts Laughlin
Joseph Rotblat

Cap
Marshall Rosenbluth
Ilya Prigogine
Lev Landau
Helmut Hönl
Thomas Edison
Ludwig Prandtl
Walter Lewin
Abraham Haskel Taub
Brian Greene
William Markowitz
Walter Houser Brattain
Edwin C. Kemble
Joseph Louis Lagrange|Joseph-Louis Lagrange
Carolina Henriette Mac Gillavry
Yakov Frenkel
Herman Feshbach
Gregor Wentzel
Balthasar van der Pol
Fritz Houtermans
Hideki Yukawa
Otto Stern
Fritjof Capra
Rolf Landauer
Samuel C. C. Ting|Samuel Chao Chung Ting
Polykarp Kusch
Lew Kowarski
Ralph Asher Alpher
Robert Boyle
Milton S. Plesset
Charles Thomson Rees Wilson
Rudolf Mössbauer
Friedrich Hund
Toichiro Kinoshita
Robert Hofstadter
Michio Kaku
Julian Schwinger
Irving Langmuir
Toshihide Maskawa
Antony Garrett Lisi
Abdus Salam
Lincoln Wolfenstein
Emilio G. Segrè
Henk Dorgelo
Grigory Landsberg
Léon Van Hove
Carl Ramsauer
David Lee (physicist)|David Lee
Gerard K. O'Neill
Bob Lazar
Evgeny Velikhov
Paul Langevin
Pierre Louis Dulong
William Shockley|William Bradford Shockley
Paul Peter Ewald
Robbert Dijkgraaf
Ernst Stueckelberg
Martin Knudsen
Manfred von Ardenne
Frank J. Tipler
Leó Szilárd
Nikolay Umov
Daniel Bernoulli

Sag
Tom Baehr-Jones
Valentine Telegdi
Ernst Ruska|Ernst August Friedrich Ruska
Emil Wiechert
Predhiman Krishan Kaw|Predhiman K. Kaw
Rudolf Clausius
James Prescott Joule
Wilhelm Wien
Johannes Kepler
Yoichiro Nambu
Igor Kurchatov
Ali Javan
Ralph H. Fowler
Michael Woolfson
Erwin Fues
Walter Heitler
Walther Kossel
Ronald Ernest Aitchison
John C. Slater
Robert Woodrow Wilson
Stephen Hawking
William Prout
Walther Bothe
Vladimir Fock
Satyendra Nath Bose
Albert Abraham Michelson
Jean-Pierre Vigier
Brian David Josephson
Edward Teller
Paul Ehrenfest
Richard A. Muller

Leo
Adolfas Jucys
William A. Bardeen
Konstantin Novoselov
John Dalton
Saul Perlmutter
Oskar Klein
John Henry Poynting
Louis Essen
Arpad Elo
Edward Witten
Heike Kamerlingh Onnes
Max Delbrück
Jim Al-Khalili
Jeffrey Goldstone
George Sudarshan
Nathan Seiberg
Murray Gell-Mann
Edward Victor Appleton
James Franck
Roy J. Glauber|Roy Jay Glauber
Hans Georg Dehmelt
Stephen Wolfram
James Dewar
Léon Foucault
Arthur Compton
Donald A. Glaser|Donald Arthur Glaser
Hermann von Helmholtz
Swapan Chattopadhyay
Herbert Kroemer
Peter Goddard (physicist)|Peter Goddard
Samuel T. Durrance
Masatoshi Koshiba
Robert K. Logan
William Eccles
Viktor Hambardzumyan
Gabriele Veneziano
Matthew Koss
Holger Bech Nielsen
Chen Ning Yang
James Hopwood Jeans|Sir James Jeans
Edward Mills Purcell
Juan Martín Maldacena
Carl David Anderson
Sergei Tyablikov
Robert von Lieben
Peter Freund
Woldemar Voigt
Victor Frederick Weisskopf
Luigi Galvani
Heinrich Welker
Irène Joliot-Curie
Alexander Animalu
Amédée Mouchez
Norman Foster Ramsey, Jr.
Osborne Reynolds
Sidney Drell
Michael Faraday

Vir
Neil deGrasse Tyson
Semen Altshuler
John Polkinghorne
Hippolyte Fizeau
José Enrique Moyal
Evangelista Torricelli
Clifford Shull
Wolfgang Ketterle
Daniel Friedan
Eugene Feenberg
Johann Georg Tralles
Henry Cavendish
James Chadwick
Ernest Walton
Vitaly Ginzburg
Clinton Davisson
Marvin Leonard Goldberger
Andre Geim
James Cronin
Gian-Carlo Wick
Isaak Markovich Khalatnikov
Vitaly Ginzburg|Vitaly Lazarevich Ginzburg
Karl Schwarzschild
Adolf Kratzer
Leo Graetz
Jean Baptiste Perrin
Jens Martin Knudsen
Robert Marshak
Peter Mansfield
Mihajlo Idvorski Pupin
Thomas Corwin Mendenhall
Alexei Yuryevich Smirnov
Max von Laue
Riccardo Giacconi
Bent Sørensen (physicist)|Bent Sørensen
Subrahmanyan Chandrasekhar
Martin Ryle
Antonino Zichichi
Georges Sagnac
Niels Bohr
Barton Zwiebach
Willem 's Gravesande
Pascual Jordan
Martin Gutzwiller
Joseph Plateau
Werner Israel

Can
Rudolf Haag
Wolfgang Paul
Paul Dirac
Zoltán Lajos Bay
Adriaan Fokker
Walter Gerlach
Hans Christian Ørsted
Léon Brillouin|Léon Nicolas Brillouin
John L. Hall|John Lewis Hall
John Tyndall
Gustave-Adolphe Hirn
Heinz Barwich
Nikolay Bogolyubov
Anatoly Vlasov
Amedeo Avogadro
Shirley Ann Jackson
Rosalind Franklin
William Alfred Fowler
Roger Penrose
Amrom Harry Katz
Emil Wolf
Hermann Brück
Alladi Ramakrishnan
Isidor Isaac Rabi
David Ruelle
John C. Mather|John Cromwell Mather
Egon Orowan
Loránd Eötvös
Anders Jonas Ångström
John Clive Ward
Bruno Pontecorvo
John Hopkinson
Serge Rudaz
Walter H. Schottky
Aleksandr Stoletov
Vikram Sarabhai
Ernest Lawrence
Étienne-Louis Malus
Erwin Schrödinger
Friedrich Ernst Dorn
William Rowan Hamilton
Fulvio Melia

Ari
Oliver Heaviside
David Enskog
Augustin-Jean Fresnel
Willem de Sitter
Frank Wilczek
Carlo Marangoni
Peter Grünberg
Paul Davies
Johann Jakob Balmer
Jean-Baptiste Biot
Carl Eckart
Robert W. Wood
Pierre Curie
Max Volmer
Oleg Losev
Leonid Mandelstam
George Paget Thomson
Percy Williams Bridgman
Max Planck
Arno Allan Penzias
Carlo Rovelli
Abdul Qadeer Khan
Felix Ehrenhaft
Max Tegmark
Helen Quinn
Johannes Georg Bednorz
Steven Weinberg
Antony Hewish
Wolfgang K. H. Panofsky
Manfred Eigen
Derek Abbott
Karl Alexander Müller
Kai Siegbahn
Juris Upatnieks
Abraham Pais
Yuval Ne'eman
Richard Feynman
Richard Makinson
Arthur Korn
Owen Willans Richardson
Henri Poincaré
Aaldert Wapstra
Joseph Polchinski
Alfred Kastler
Harrie Massey
Gersh Budker
Arthur Leonard Schawlow
Ernst Ising
James David Forbes

Pis
Denis Evans
Franz S. Exner
Peter Debye|Peter Debije
Feza Gürsey
Douglas Hartree
John Leslie (physicist)|John Leslie
Burton Richter
Bernhard Philberth
Thomas Johann Seebeck
Arthur Wightman
Claude Cohen-Tannoudji
Carlo Rubbia
Christiaan Huygens
Horst Ludwig Störmer
Makoto Kobayashi (physicist)|Makoto Kobayashi
William Lawrence Bragg
Mikhail Shifman
James Alfred Ewing
James Glimm
Ivar Giaever
Stanislaw Ulam
Heinz Pose
Joseph Stefan|Jožef Stefan
Ludvig Faddeev
Wilhelm Röntgen|Wilhelm Conrad Röntgen
Macedonio Melloni
Marcela Carena
Peter Adolf Thiessen
Ernst Brüche
Carl Wieman
Joseph Fourier
Sin-Itiro Tomonaga
Joseph Hooton Taylor, Jr.
Maurice Goldhaber
Johannes Stark
Nima Arkani-Hamed
Max Steenbeck
Vladimir Gribov
Maurice Loewy
Anthony James Leggett
Emmy Noether
Ashok Das
Jan Zaanen
Jerome Isaac Friedman
Nicola Cabibbo

Gem
Gary Gibbons
Siméon Denis Poisson
Ben Roy Mottelson
Gerd Binnig
Johann Baptiste Horvath
Tullio Regge
Hans Geiger
Bertram Brockhouse
Brian May
Willis Lamb
Paul Drude
Pyotr Kapitsa
Samuel Goudsmit
Fred Hoyle|Sir Fred Hoyle
Nikola Tesla
Burchard de Volder
William Henry Bragg
Jocelyn Bell Burnell
Pavel Cherenkov
John Ellis (physicist)|John Ellis
Robert Coleman Richardson
Karl Zimmer
J. Hans D. Jensen|Johannes Hans Daniel Jensen
Hans Bethe
Frederick Seitz
Paul Harteck
William Thomson, 1st Baron Kelvin|William Thomson (Lord Kelvin)
Hendrik Lorentz
Félix Savart
Pervez Hoodbhoy
Hubert Reeves
Hendrik Casimir
Victor Francis Hess
Luciano Maiani
Martin Rees, Baron Rees of Ludlow|Martin John Rees
Robert Hooke
Maria Goeppert-Mayer
Martin Lewis Perl
Alexander Prokhorov
Frits Zernike
Alexei Alexeyevich Abrikosov
Gottfried Wilhelm Leibniz
Jayant Narlikar
Theodore Maiman
Georges Lemaître
Owen Chamberlain
David Finkelstein
Leon M. Lederman|Leon Max Lederman
Edwin Thompson Jaynes|Edwin Jaynes
John Stewart Bell
Karl L. Littrow
Carl Friedrich von Weizsäcker
Boris Podolsky
Klaus von Klitzing
Martinus J. G. Veltman
Willem Hendrik Keesom
John Archibald Wheeler

Tuesday, January 31, 2012

Monte Carlo Simulation in Python

Problem:
Suppose you flip coins for a certain times, say n. You want to know your chance of getting certain number of heads in a row, say k heads in a row.

Let's start with the function that'll do a certain number of coin flips:
Let's do 10 flips right away:
do_n_flips(10)
'0100111001'

Let's suppose you get 1 euro if you get the given number of heads in a row while doing a certain number of flips. Here's the function that'll calculate your payoff:
Let's try our payoff function with doing 10 flips and getting payed for 3 heads in a row. sometimes it is 0, sometimes it is 1:
payoff(10, 3)
0.0
payoff(10, 3)
1.0

Now, what is your chance to get your 1 euro for 3 heads in a row from 10 flips? This brings us to the Monte Carlo simulation, which I'wont describe here at all, just give you a function which answers our question regarding the chance:
Running with a million iteration gives us a good approximation for the chance:
monte_carlo_solve(10, 3, 1000000)
0.507753

This post was inspired by a similar post by Remis.

Wednesday, January 4, 2012

How to use wildcard in Google search

This example demonstrates how to use a wildcard in Google search.

Lets suppose we have this idiom: "if you have a somebody for a friend, you don't need an enemy", but we do not remember exactly what that  somebody would be.

Let's ask Google: "If you have a * for a friend, you don't need an enemy".

And here we go. For the record, at the time being, Google thinks this somebody is either a Hungarian or a politician. I assume, in fact it is a Hungarian politician :)

Friday, December 16, 2011

Twibet: a recursive solution in Ptyhon

Today a friend told me she's struggling with solving the EuroPython 2011 Problem D, called Twibet. I wrote a quick solution and I'm publishing it here for those who are curious about it.
The key to my solution was the recursive traversal (traverse_graph) of a directed graph. The representation of the graph is a dictionary of the monks as keys and a list of the following monks as values.
Because running on the large input file raised a runtime error of  maximum recursion depth exceeded, I used a this little hack: sys.setrecursionlimit(10000).
UPDATE: Here are the solutions of the contestants.

Thursday, August 25, 2011

Concurrent matrix multiplication in Python

We're going to demonstrate parallel matrix multiplication in Python.

Lets suppose we are performing the multiplication: P = A * B. We're going to calculate the result matrix row-by-row. The following function returns a row of a result matrix P. The arguments are a row of matrix A and the matrix B.

def calc_row_of_product_matrix(a_row, b):
    return map(lambda col: sum(starmap(mul,izip(a_row,col))), izip(*b))


We're going to use the multiprocessing module of Python to create a process pool containing as many processes as the number of CPUs we have and calculate each result row in a different process. Here are the two lines that does this trick:

def __mul__(self, b):
    pool = multiprocessing.Pool(multiprocessing.cpu_count())
    return pool.map(eval_func_tuple, izip(repeat(calc_row_of_product_matrix), self, repeat(b)))


Some explanation why is this code a bit nasty. Functions passed to the pool has to be picklable and because only the functions defined at the top level of a module are picklable, we needed to get the the row calculation function calc_row_of_product_matrix out of the class. Furthermore, to be able to pass two arguments to calc_row_of_product_matrix we also needed a helper function, which takes a tuple of a function and args, evaluates and returns the result:

def eval_func_tuple(f_args):
    return f_args[0](*f_args[1:])


Note that passing around the whole matrix B to all processes calculating result rows, or more precisely passing itertools.repeat(b) to  pool.map() visibly increases the memory consumption of the multiprocess version. This was not a real issue for me, as the bottleneck was CPU; anyway, this issue could be addressed by using the shared memory module of multiprocessing. For now we'll leave that as an exercise to the reader.

Here are the running times on my Intel Core2 Duo 3.16GHz box. For sufficiently large matrix (above 500*500) the running time of the multiprocess version is nearly half of the single process version.


100*100 matrix, single process: 0.0835670982343 
100*100 matrix, multiprocess: 0.351096555199 
200*200 matrix, single process: 0.79961114284 
200*200 matrix, multiprocess: 0.700980680681 
500*500 matrix, single process: 14.4003259234 
500*500 matrix, multiprocess: 7.99582457187 
1000*1000 matrix, single process: 118.078526896 
1000*1000 matrix, multiprocess: 66.8809939919 

Here you can see the single process version running with CPU usage of 50%:

Here you can see  the multiprocessing version running with CPU usage of 100%:



Here's the full code:

Wednesday, June 22, 2011

Ngrams with coroutines in Python

This is how I define ngrams with coroutines

I need to filter text before generating ngrams and also, I want to process ngrams (in this case count bigrams)

I combine my coroutines together

Full source can be found in my fork of rrenaud's gibberish detector.