Julien's tech blog

talking about tech stuff: stuff I'm interested in, intriguing stuff, random stuff.

Monthly Archives: August 2011

Exception handling, Checked vs Unchecked exceptions, …

Some thoughts in random order about exceptions in Java, as they often get overlooked. The following points are often related. Mostly I’m talking to myself here, don’t take it too personally when I say “you should do this” 🙂 just imagine I’m the guy from Memento and that I tattoo myself with my blog posts. Feel free to disagree/add your own views in the comments (though I may not get your opinion tattooed on my body). Here I’m assuming we are building a library that others will use.

1. Exceptions are part of the contract

When you declare a method, it defines a contract for the level of abstraction of the class. The exceptions are part of this contract; exceptions thrown by a method should be relevant to the level of abstraction. Just throwing the exceptions from the underlying layer is letting the implementation details slip through the interface. (See the point about chaining exceptions)

Example :
You write a login method implemented by querying an underlying DB. It should throw a LoginFailedException (identifier or password incorrect), it should not throw a SQLException. LoginFailedException extends a base class that you define (extending java.lang.Exception) and is common to all your checked exceptions. (See following point).

2. Have base classes for the checked and unchecked exceptions

Sometimes it is useful to be able to catch exceptions of one specific library at a higher level. For example, catching all the runtime exceptions that have not been handled by the library. That way they can be caught in a single catch block. In Java 7 you will be able to catch multiple exceptions in a single block. Inheritance for exception is just a category tree even if usually that’s not how you want to think of inheritance.

3. Do not throw java.lang.Exception and make sure that declared exceptions are actually thrown

Throwing java.lang.Exception hides the different cases that you may have to handle depending of what type of exception it is. Checked exceptions is a handy way of making sure you don’t forget to handle error cases, as long as you declare them correctly. People using your library will swear (or maybe it’s just me. Either way, if I’m going to use your code, please do it for my co-workers). In general design your API in a way that does not force people to declare catch blocks in cases when exceptions are never thrown. For examplenew String(bytes, “UTF-8”) throws UnsupportedEncodingException even though UTF-8 is always supported. In java 6 new String(byte[], Charset) was added to avoid this.

4. Checked vs Unchecked exceptions

Checked exceptions are for cases where the caller should do something about the error. This is for exceptional cases that should be handled. For example, if the login failed, you should problably display an error and ask to retry. The users will dometimes mistype their password and you should always handle it.
Unchecked exception are for run time errors caused by bugs or unexpected failures when the caller could not possibly do something about it and the default expected behavior is just to fail. Most of the time you want to centrally handle those at the top of the stack to display an error message or send an alert to the monitoring system. The caller can still catch it if it wants to, but it does not have to.
For example, if the data base refuses the connection you may throw a DatabaseUnavailableException that extends a base class (extending java.lang.RuntimeException) common to all your uncaught exceptions. You could have an intermediary layer that will retry the transaction or a top level apologetic error message asking to come back later. The main point is that the exception is not dealt with where it is thrown.

5. Chaining Exceptions

Since Java 1.4 all exceptions can be chained. When you catch an Exception and throw a new one related to your level of abstraction, you should chain the original one to make sure you have all available information to fix a problem. A good stack trace tells you exactly where the bug is (See the “fail early” point). It not fun when the production issue you have to fix urgently reports itself without providing the root cause. You usually end up patching the exception chaining first then reproduce the error then know what happened.

Sometimes people ask how to display the “… 2 more” at the end of a chained stack trace. The display does not truncate any information and those lines are already there in the parent stack trace. Obviously exceptions that have a cause will have the end of the stack trace in common with their cause, printStackTrace() is not printing those duplicate lines.

6. Add more information.

When you catch an Exception and throw a new one related to your level of abstraction, you should add information related to this upper level.
example:

void readConfiguration() throws InvalidConfigurationException {
  File confFile = new File("conf/conf.properties");
  try {
   ...
  } catch (IOException e) {
    throw new InvalidConfigurationException("Error while reading the configuration at "+confFile.getAbsolutePath(),e);
  }
 }

In general put as much information as you can (id of the object for which it failed …), but keep it to one line.

7. do not have empty catch blocks

If the exception can not possibly get thrown (new StringReader(stream,”UTF-8″) throws UnsupportedEncodingException) just throw an exception saying so. That way if you’re wrong you don’t hide the problem.
If it is really what you want to do (the API you call probably needs refactoring) at least put a comment explaining why.

8. Fail early

Prefer throwing an exception to use a default value when it fail. You prefer your code to tell you what’s wrong instead of doing something you didn’t ask for.
If something is not what’s expected throw an exception, better fail early on the real cause than later on the consequences which will be harder to debug.

9. Read the error message

When there’s an exception if you did a good job, you should be able to know quickly what the problem is by reading the messages and the stack traces of the chain of exceptions. I know this one sounds silly, but how many times have you had a bug report mentioning that “it fails” without a stacktrace?

(this is an open list, I may come back later to add some)

Advertisements

Transitive Closure in Pig

1. Introduction

This a follow-up on my previous post about implementing PageRank in Pig using embedding. I also talked about this in a presentation to the Pig user group.
One of the best features of embedding is how it simplifies writing UDFs and using them right away in the same script without superfluous declarations. Computing a transitive closure is a good example of an algorithm requiring an iteration, a few simple UDFs and an end condition to decide when to stop iterating. Embedding is available in Pig 0.9. Knowledge of both Pig and Python is required to follow. Examples are available on github.

2. The problem

Before diving into the solution I will give a brief overview of the problem. We want to find the connected components of a graph by performing a transitive closure. Of course I will assume that the graph does not fit in main memory otherwise what follows is not very useful.
A picture is worth a thousand words; here is a visual statement of the problem:

Transitive Closure

Transitive Closure


I highlighted the diameter in red as it plays an important role in the complexity of the algorithm. The diameter of a connected component is the longest shortest path between 2 nodes of the component.
The algorithm will require log2(max(diameter of a component)) iterations because every iteration we double the number of relations that have been followed. This is not the overall complexity but it ensures that the number of iterations will stay low as with just 10 iterations you can handle components with a diameter of 1000+. In most real life use cases the diameter is very low and a few iterations will suffice.

3. The algorithm

I wrote the following implementation to illustrate the embedding feature. It may not be optimal but it is a working solution with a reasonable number of iterations. The algorithm does a graph traversal from all the nodes in parallel. It does one hop per iteration (pardon the scientific jargon) and then merges the results per node (using follow() and OR()). As a consequence the diameter of the graph followed from a node doubles every iteration. In this case, the edges of the graph are considered bidirectional and are stored one way (see normalize() and bidirectional()) to save space. The edges are stored in two files: followed (empty at the beginning) and to_follow which will tell us when all the edges have been followed (to_follow is empty). The last part of the script does some formatting to turn a list of edges into group of nodes (using sort() to make sure they are always represented the same way).

4. The code

Here is an example of how to solve this problem by embedding Pig in Python.
(explanation follows)
Input data (data/tc_data_simple.txt):

id0	id1
id1	id2
id2	id3
id3	id4
id4	id5
id5	id6
id6	id7
id7	id8
id11	id10
id12	id11
id13	id12
id14	id13
id14	id15
id15	id16
id16	id17
id17	id18

simpletc.py:

#!/usr/bin/python

from org.apache.pig.scripting import *

@outputSchema("rels:{t:(id1: chararray, id2: chararray)}")
def bidirectional(id1, id2):
    if id2 != id1:
        return [ (id1, id2), (id2, id1) ]
    else:
        return [ (id1, id2) ]

def normalize(t):
    id1, id2, followed = t;
    if id2>id1:
        return (id1, id2, followed)
    else:
        return (id2, id1, followed)

@outputSchema("rels:{t:(id1: chararray, id2: chararray, followed: int)}")
def follow(to_follow_id1, links_id1, links_id2):
    outputBag = [ normalize( (links_id1, links_id2, True) ) ]
    if to_follow_id1 is not None:
        outputBag.append( normalize( (to_follow_id1, links_id2, False) ) )
    return outputBag

@outputSchema("followed: int")
def OR(bag):
    result = False;
    for followed in bag:
        result = result | followed[0]
    return result  

@outputSchema("group: {t: (id: chararray)}")
def SORT(bag):
    bag.sort()
    return bag

def main():
    # cleanup output directory before starting
    Pig.fs("rmr out/tc")
   
    Q = Pig.compile("""
    followed = LOAD ‘$followed’ AS (id1: chararray, id2: chararray);
    followed = FOREACH followed GENERATE FLATTEN(bidirectional(id1, id2)), 1 AS followed; — 1 == true
   
    to_follow = LOAD ‘$to_follow’ AS (id1: chararray, id2: chararray);
    to_follow = FOREACH to_follow GENERATE FLATTEN(bidirectional(id1, id2)), 0 AS followed; — 0 == false
   
    links = UNION to_follow, followed;
    joined_links = JOIN links BY id1 LEFT, to_follow BY id2;
   
    new_links_dup =
        FOREACH joined_links
        GENERATE FLATTEN( follow(to_follow::rels::id1, links::rels::id1, links::rels::id2) );

    new_links =
        FOREACH (GROUP new_links_dup BY (id1, id2))
        GENERATE group.id1, group.id2, OR(new_links_dup.followed);

    SPLIT new_links INTO new_followed IF followed != 0, new_to_follow IF followed == 0;
    new_followed = FOREACH new_followed GENERATE id1, id2;
    new_to_follow = FOREACH new_to_follow GENERATE id1, id2;
    STORE new_followed INTO ‘$new_followed’;
    STORE new_to_follow INTO ‘$new_to_follow’;
    """
    )
   
    to_follow = "data/tc_data_simple"
    followed = "out/tc/followed_0"
   
    # create empty dataset for first iteration
    Pig.fs("mkdir " + followed)
    Pig.fs("touchz " + followed + "/part-m-00000")
   
    for i in range(10):
        new_to_follow = "out/tc/to_follow_" + str(i + 1)
        new_followed = "out/tc/followed_" + str(i + 1)
        job = Q.bind().runSingle()
        if not job.isSuccessful():
            raise ‘failed’
        to_follow = new_to_follow
        followed = new_followed
        # detect if we are done
        if not job.result("new_to_follow").iterator().hasNext():
            break

    Pig.compile("""
    links = LOAD ‘$followed’ AS (id1: chararray, id2: chararray);
    links = FOREACH links GENERATE FLATTEN( bidirectional(id1, id2) );
    result = DISTINCT (
        FOREACH (GROUP links by id1)
        GENERATE SORT(links.id2));
    STORE result INTO ‘out/tc/groups’;
    """
).bind().runSingle();

if __name__ == ‘__main__’:
    main()

Output data:

{(id0),(id1),(id2),(id3),(id4),(id5),(id6),(id7),(id8)}
{(id10),(id11),(id12),(id13),(id14),(id15),(id16),(id17),(id18)}

5. Notables

There are a few things to notice in this implementation:
– UDFs are just defined by functions in the same script
– The output schema of the UDF is defined by using the @outputSchema decorator. Pig needs the schema of the output to interpret the statements following the UDF call.
– The native Python structures (dictionary, tuple, list) are used, the conversion to Pig types (Map, tuple, bag) is automatic. This makes the UDF code much more compact.
– UDFs are directly available in embedded Pig scripts using the function names. No other declaration is required.
– We iterate a maximum of 10 times (max diameter = 210 = 1024) and check if the “to follow” relation is empty to decide if we need to stop the iteration. This is done using the JobStats object returned by runSingle().

There’s also an experimental Javascript embedding that I’ll talk about in a future post.