Error: Twitter did not respond. Please wait a few minutes and refresh this page.
talking about tech stuff: stuff I'm interested in, intriguing stuff, random stuff.
This a follow-up on my previous post about implementing PageRank in Pig using embedding. I also talked about this in a presentation to the Pig user group.
One of the best features of embedding is how it simplifies writing UDFs and using them right away in the same script without superfluous declarations. Computing a transitive closure is a good example of an algorithm requiring an iteration, a few simple UDFs and an end condition to decide when to stop iterating. Embedding is available in Pig 0.9. Knowledge of both Pig and Python is required to follow. Examples are available on github.
Before diving into the solution I will give a brief overview of the problem. We want to find the connected components of a graph by performing a transitive closure. Of course I will assume that the graph does not fit in main memory otherwise what follows is not very useful.
A picture is worth a thousand words; here is a visual statement of the problem:
I wrote the following implementation to illustrate the embedding feature. It may not be optimal but it is a working solution with a reasonable number of iterations. The algorithm does a graph traversal from all the nodes in parallel. It does one hop per iteration (pardon the scientific jargon) and then merges the results per node (using follow() and OR()). As a consequence the diameter of the graph followed from a node doubles every iteration. In this case, the edges of the graph are considered bidirectional and are stored one way (see normalize() and bidirectional()) to save space. The edges are stored in two files: followed (empty at the beginning) and to_follow which will tell us when all the edges have been followed (to_follow is empty). The last part of the script does some formatting to turn a list of edges into group of nodes (using sort() to make sure they are always represented the same way).
Here is an example of how to solve this problem by embedding Pig in Python.
Input data (data/tc_data_simple.txt):
id0 id1 id1 id2 id2 id3 id3 id4 id4 id5 id5 id6 id6 id7 id7 id8 id11 id10 id12 id11 id13 id12 id14 id13 id14 id15 id15 id16 id16 id17 id17 id18
There are a few things to notice in this implementation:
– UDFs are just defined by functions in the same script
– The output schema of the UDF is defined by using the @outputSchema decorator. Pig needs the schema of the output to interpret the statements following the UDF call.
– The native Python structures (dictionary, tuple, list) are used, the conversion to Pig types (Map, tuple, bag) is automatic. This makes the UDF code much more compact.
– UDFs are directly available in embedded Pig scripts using the function names. No other declaration is required.
– We iterate a maximum of 10 times (max diameter = 210 = 1024) and check if the “to follow” relation is empty to decide if we need to stop the iteration. This is done using the JobStats object returned by runSingle().