In this article, we will show you step by step how you can build a reusable POJO-based Data Grid with nothing but standard JDK 1.5 and then make it distributed through the use of the declarative JVM-level clustering technology provided by Open Terracotta.
The article is divided into three sections:
- In the first section we will talk about the concepts of Data Grids, what they are and what they do. We will discuss some of the underlying mechanisms that are used in Data Grids and finally how one can scale out applications on a Data Grid.
- The second section will walk you through how to build a naive but fully usable implementation of a POJO-based Data Grid by implementing a multi-threaded Master/Worker container and cluster it using Terracotta's JVM-level clustering technologies.
- Last, in the third section we will show you how to extend the initial implementation to handle real-world requirements such as dealing with; high volumes of data, work failure, different routing algorithms, ordering of work and worker failure.
Part 1: Data Grids - What's behind the buzzwords?
What are Data Grids?
A Data Grid is a set of servers that together creates a mainframe-class processing service where data and operations can move seamlessly across the grid in order to optimize the performance and scalability of the computing tasks submitted to the grid. A Data Grid scales through use of Locality of Reference (see the section How do Data Grids scale?) and is highly-available through effective use of data duplication. It combines data management with data processing.
What are Data Grids used for?
Applications that could benefit from being run on a Data Grid are usually applications that needs to work on large data sets and/or have a need to parallelize processing of the data in order to get better throughput and performance. Examples are financial risk analysis and other simulations, searching and aggregation on large datasets as well as sales order pipeline processing.
How do Data Grids scale?
One of the main reasons why Data Grids can scale so well is that they can make intelligent choices if it should move data to its processing context or moved the processing context (the operations) to the data. Effective use of the latter means that it can make use of Locality of Reference; this means that data that is being processed on a specific node stays local to that node and will in the ideal case never have to leave that node. Instead, all operations that are working on this specific data set will always be routed to this specific node.
The immediate benefits are that it has the advantage of minimizing the latency and can in the ideal case give unlimited and linear scalability. How well it scales is of course use-case specific, and it can take both effort and skills in partitioning the data and work sets, as well as potential routing algorithms. The ultimate (and simplest) situation is when all work are what we call Embarrassingly Parallel - which means that it has no shared state and can to its working complete isolation. But it is in most cases acceptable to partition the work into work sets, in such a way that the work sets have no shared state. However, the latter solution might need some intelligent routing algorithms, something that we will take a look at later.
How do Data Grids deal with failure?
Data Grids are resilient to failure and down time by effective use of data duplication. Data Grids can be seen as an organic system of cells (nodes) that is designed to not only handle, but expect failure of individual cells (nodes). This is very different from traditional design of distributed systems in which each node is seen as an isolated unit which must always expect the worst and protect itself accordingly. See this page for a more thorough discussion on the subject.
Part 2: Build a multi-threaded Master/Worker container and cluster it with Open Terracotta
What is the Master/Worker pattern?
The Master/Worker pattern is a pattern that is heavily used in Data Grids and is one of the most well-known and common patterns for parallelizing work. We will now explain the characteristics of the pattern and then go through the possible approaches for a Java implementation.
So, how does it work?
The Master/Worker pattern consists of three logical entities: a Master, a Shared Space and one or more instances of a Worker. The Master initiates the computation by creating a set of tasks, puts them in some shared space and then waits for the tasks to be picked up and completed by the Workers. The shared space is usually some sort of Shared Queue, but it can also be implemented as a Tuple Space (for example in Linda programming environments, such as JavaSpaces, where the pattern is used extensively). One of the advantages of using this pattern is that the algorithm automatically balances the load. This is possible due to the simple fact that, the work set is shared, and the workers continue to pull work from the set until there is no more work to be done. The algorithm usually has good scalability as long as the number of tasks, by far exceeds the number of workers and if the tasks take a fairly similar amount of time to complete.
Possible approaches in Java
Let's now look at the different alternatives we have for implementing this in Java. There might be more ways of doing it, but we will focus the discussion on three different alternatives, each one with a higher abstraction level.
Using Java's threading primitives
The most hard-core approach is to use the concurrency and threading primitives that we have in the Java Language Specification (JLS), e.g. wait/notify and the synchronized and volatile keywords. The benefits are that everything is really "under your fingers", meaning that you could customize the solution without limitations. However, this is also its main problem, since it is both a very hard and tedious to implement this yourself, and will most likely be even worse to maintain. These low-level abstractions is not something that you want to work with on a day-to-day basis. We need to raise the abstractions level above the core primitives in the Java Memory Model and that is exactly what the data structure abstractions in the java.util.concurrent library in JDK 1.5 does for us.
Using the java.util.concurrent abstractions
Given that using the low-level abstractions in JLS is both tedious and hard to use, the concurrency abstractions in JDK 1.5 was both a very welcome and natural addition to the Java libraries. It is a very rich API that provides everything from semaphores and barriers to implementations of the Java Collections data structure interfaces highly tuned for concurrent access. It also provides an ExecutorService, which is mainly a thread pool that provides direct support for the Master/Worker pattern. This is very powerful, since you're basically getting support from the Master/Worker pattern in a single abstraction.
It is possible to cluster the ExecutorService using Terracotta, you can read about that exercise in this article. Even though this approach would be sufficient in many situations and use-cases it has some problems:
- First, it does not separate the master from the worker (since they are both part of the same abstraction). What this means in practice is that you cannot scale out that the master independently of the worker but each node will contain both a master and a worker.
- Second, and perhaps even more important, is that it is lacking a layer of reliability and control. Meaning that it is no way of knowing if a specific work task has been started, completed or if it has been rejected due to some error. This means that there is no way we can detect work failure and can retry the work task on the same node or on another node. So we need a way to deal with these things in a simple and if possible standardized way.
Using the CommonJ WorkManager specification
Spawning and coordinating threads is something that have been prohibited by the EJB specification. However, there has naturally been (and still is) a common need for coordinating work in JEE (something that was partly, but not completely solved in a very verbose way with Message-Driven Beans (MDB)). This need was the main motivation why IBM and BEA decided to do a joint specification that solves this problem by providing a standardized way of executing concurrent tasks in a JEE environment. The specification is called CommonJ and has support for the Master/Worker pattern in its WorkManager API.
From BEA's documentation about the specification:
"The Work Manager provides a simple API for application-server-supported concurrent execution of work items. This enables J2EE-based applications (including Servlets and EJBs) to schedule work items for concurrent execution, which will provide greater throughput and increased response time. After an application submits work items to a Work Manager for concurrent execution, the application can gather the results. The Work Manager provides common "join" operations, such as waiting for any or all work items to complete. The Work Manager for Application Servers specification provides an application-server-supported alternative to using lower-level threading APIs, which are inappropriate for use in managed environments such as Servlets and EJBs, as well as being too difficult to use for most applications."
What is interesting is that specification not only defines an API for submitting work and getting the result back, but it also provides functionality for tracking work status in that it allows you to register listener that will receive callback events whenever the state of the work has been changed. This makes it possible to for example not only detect work failure but also the reason why it failed, which gives us a possibility of restarting the work.
Each of the three approaches we have looked at so far have been built upon the previous one and has gradually raised the abstraction level. But even more importantly minimized and simplified the code that we as users have to write and maintain.
The most natural choice is to base our implementation on the CommonJ WorkManager specification. It is a simple and minimalistic specification and seems to provide everything that we need in order to build a reliable Master/Worker container.
Here are the interfaces in the specification:
The plan is that we will first implement the specification as a regular multi-threaded application using the java.util.concurrent abstractions (in particular the ExecutorService and the LinkedBlockingQueue classes) and then use Open Terracotta to declaratively (and transparently), turn it into a multi-JVM, distributed implementation.
Creating the Master (WorkManager)
We start with creating the SingleQueueWorkManager, which serves as our Master. It implements the CommonJ WorkManager interface which provides the API that the user uses to schedule the Work and wait for the Work to be completed. It has a reference to the work queue that is shared between the Master and the Workers, in this case represented by the SingleWorkQueue abstraction.
Here is how we could implement the work manager:
Creating the Shared Queue
The SingleQueueWorkManager schedules work by adding work to the SingleWorkQueue. The SingleWorkQueue is the artifact that has state that needs to be shared between the Master and the Workers, since it holds the queue with all the pending Work. We need to have a single instance of this queue that can be available to both the Master and all its Workers.
The work queue can be implemented like this:
As you can see, it is a simple wrapper around a java.util.concurrent.BlockingQueue queue and has three methods; take(), put() and peek() which simply delegates to the BlockingQueue but also adds error handling in that it sets the status for the Work in case of failure.
Creating the Worker
Finally, we have the Worker, in our case represented by the SingleQueueWorker abstraction. This class uses a thread pool to spawn up N number of worker threads that continuously grabs and executes Work from the SingleWorkQueue. During the processing of the Work, the status flag in the wrapping WorkItem is maintained (can be one of either ACCEPTED, STARTED, COMPLETED or REJECTED). This is needed in order for the SingleQueueWorkManager to be able to continuously monitor the status of the Work it has scheduled.
This is what a Worker implementation can look like. As you can see we choose to make use of the ExecutorService thread pool implementation in the java.util.concurrent package:
What about my Work?
Now we have the three main artifacts in place; the Master, the Worker and the Shared Queue. But what about this Work abstraction and which role does the DefaultWorkItem play?
Starting with the Work abstraction. It is an interface in the CommonJ WorkManager specification but is an interface that we cannot implement generically in the WorkManager "container" we are building now, but should be implemented by the user of this container since the implementation of the actual work that is supposed to be done is of course use-case specific.
The DefaultWorkItem is an implementation of the WorkItem interface in the specification. It's purpose is simply to wrap a Work instance and provide additional information, such as status and an optional WorkListener and can look something like this:
A WorkListener is a listener that the user can register in order to track the status of the work and can upon reception of a state-changed-event take proper action (like retry upon failure etc.)
That's it, now we now have a fully functional single-JVM, multi-threaded, implementation of the CommonJ WorkManager specification. Even thought this could be useful as is, there is no way we can scale it out with the needs of the application, since we are bound by the computing power in the single box we are using. Plus, it is not topic for the article. So, let's make it a bit more interesting.
Introducing Open Terracotta
Open Terracotta is a product that delivers JVM-level clustering as a runtime infrastructure service. It is Open Source and available under a Mozilla-based license.
Terracotta uses bytecode instrumentation to adapt the target application at class load time. In this phase it extends the application in order to ensure that the semantics of the Java Language Specification (JLS) are correctly maintained across the cluster, including object references, thread coordination, garbage collection etc.
Another important thing to mention is that Terracotta does not use Java Serialization, which means that any POJO can be shared across the cluster. What this also means is that Terracotta is not sending the whole object graph for the POJO state to all nodes but breaks down the graph into pure data and is only sending the actual "delta" over the wire, meaning the actual changes - the data that is "stale" on the other node(s).
Terracotta is using an architecture known as hub-and-spoke, which means that it has one central L2 server and N number of L1 clients. (The L1 client is the Terracotta client JARs running inside the target JVM, while the L2 server is the central Terracotta server. L1 and L2 refers to first and second level caches.)
This might seem strange, since most clustering solutions on the market today are using peer-to-peer, but as we will see, hub-and-spoke has some advantages and makes it possible to do some very interesting optimizations. The server plays two different roles:
- First it serves as the coordinator ("the traffic cop") in the cluster. It keeps track of things like; which thread in which node is holding which lock, which nodes are referencing which part of the shared state, which objects have not been used for a specific time period and can be paged out, etc. Keeping all this knowledge in one single place is very valuable and allows for very interesting optimizations.
- Second, it serves as a dedicated state Service of Record (SoR), meaning that it stores all the shared state in the cluster. The state server does not know anything about Java, but only stores the bytes of the data that has changed plus a minimal set of meta info. The L2 server itself is clusterable through a SAN-based failover mechanism. This means that it is possible to scale-out the L2 server in the same fashion as most peer-to-peer solutions but with the advantage of keeping the L2 separate from the L1. This separation allows us to scale out the L1 clients and the L2 servers independently of each other, which is the way that the Internet scales.
One way of looking at Terracotta is to see it as Network Attached Memory. Network Attached Memory (NAM) is similar to Network Attached Storage (NAS) in the sense that JVM-level (heap-level) replication is making NAM's presence transparent just like NAS can be transparent behind a file I/O API. Getting NAM to perform and scale is similar to any I/O platform; read-ahead buffering, read/write locking optimizations etc.
Even though it is in clustering, meaning scalability and high-availability, of Web and enterprise applications, that Terracotta can bring its most immediate value, it is really a platform for solving generic distributed computing and shared memory problems in plain Java code. This is something that makes it applicable to a wide range problem domains, for example building a POJO-based Data Grid.
Let's cluster it using Open Terracotta!
I know what you are thinking:
"Clustering, hmmm... Now comes the hard part right?"
It turns out that in order to cluster our current WorkManager implementation, all we have to do is to create a Terracotta configuration file in which we define three things:
- The fully qualified name of the top level object in the object graph that we want to share. In our case we want to share the work queue. This means that the most natural root would be the LinkedBlockingQueue in the SingleWorkQueue class, e.g. the m_workQueue field.
- The classes that we want to include for instrumentation. We can include all classes for instrumentation, e.g. use a "match-all" pattern, but it is usually better to narrow down the scope of the classes that Terracotta needs to introspect.
- The potential lock boundaries that we want Terracotta to introspect. These are called auto-locks and is simply a hint that Terracotta should treat the synchronized blocks in these places as transaction boundaries. (You can also define explicit locks called named-locks.) In our case we will define a "match-all" pattern for the locking, something that works fine in most cases and should be treated as the default.
Now we have a distributed/multi-JVM CommonJ WorkManager ready for use. But in order to call it a POJO-based Data Grid, we need extend it a bit and address some challenges that are likely to arise if we were to deploy this implementation into production.
Part 3: Getting serious - Handling real-world challenges
Now we have learned the basics of Data Grids, the Master/Worker pattern and what the CommonJ WorkManager specification is all about. We also walked you through how to implement a distributed CommonJ WorkManager by first creating a single-JVM implementation that we then cluster into a distributed multi-JVM implementation with Open Terracotta.
However, it was a fairly simple and in some sense naïve implementation. This was a good exercise in terms of education and understanding, but in order to use the implementation in the real world we need to know how to address some of the challenges is that might come up. Now we will discuss some of these challenges and how we can extend the initial implementation to address them.
The challenges that we will look at, one by one, are how to handle:
Dealing with very high volumes of data
Our current implementation has one single queue that is shared by the master(s) and the all workers. This usually gives acceptable performance and scalability when used with a moderate work load. However, if we need to deal with very high volumes of data then it is likely that we will bottleneck on the single queue. So how can we address this in the most simple fashion?
The perhaps simplest solution is to create one queue per worker, and have the master do some more or less intelligent load-balancing. If we are able to do a good partition of the work and data, that we discussed in the previous section (Scaling Data Grids), then this solution is one that will:
- Maximize the use of Locality of Reference - since all work that are operating on the same data set will be routed to the same queue with the same worker working on them
- Minimize contention - since since there will only be one reader and one writer per queue.
If we take a look at the current code, what needs to be changed is to first change the SingleWorkQueue class to a WorkQueueManager class and swap the single LinkedBlockingQueue to a ConcurrentHashMap with entries containing a routing ID mapped to a LinkedBlockingQueue:
Swap the wrapped LinkedBlockingQueue:
To a ConcurrentHashMap of LinkedBlockingQueue's :
We also need to change the WorkManager implementation and rewrite the schedule(..) methods to use a Router abstraction and not put the work directly into the queue:
Change the schedule(..) method from:
To redirect to the Router abstraction (more on the Router below):
In the last code block we could see that we have introduced a new abstraction; the Router, which leads us to the topic of how to deal with different routing schemes.
Dealing with different routing schemes
By splitting up the single working queue into multiple queues each with a unique routing ID we are opening up for the possibility of providing different routing schemes that can be customized to address specific use cases.
As in the previous section, we have to make some changes to the initial single queue implementation. First we need to add a routing id to the WorkItem abstraction, we do that by adding the generic type ID and let the WorkItem implement the Routable interface:
As we saw in the previous section, we also introduce a new abstraction called Router:
This abstraction can be used to implement various load-balancing algorithms, such as for example:
- Round-robin - Router circles around all queues one by one
- Work load sensitive balancing - Router looks at queue depth and always sends the next pending work to the shortest queue
- Data affinity - "Sticky routing", meaning that the Router sends all pending work of a specific type to a specific queue
- Roll your own - to maximize Locality of Reference for your specific requirements
In our Router implementation we have three default algorithms; round-robin, load-balancing and single queue. You can find them as static inner classes in the Router interface.
Here is an example of the load-balancing router implementation that takes an array of the routing IDs that are registered and always sends the next pending work to the shortest queue:
Dealing with work failure
As we discussed earlier in this article the CommonJ WorkManager specification provides APIs for event-based failure reporting and tracking of work status. Each Work instance is wrapped in a WorkItem instance which provides status information. It also gives us the possibility of defining an optional WorkListener through which we will get callback events whenever the status of the work has been changed.
The WorkListener interface looks like this:
As you can see, we can implement callbacks methods that subscribes to events triggered by work being accepted, rejected, started and completed. In this particular case we are mainly interested in doing something when receiving the work rejected event. In order to do that to we simply need to create a implementation of the WorkListener interface and add some code in the workRejected method:
Dealing with ordering of work
In some situations it might be important to define and maintain ordering of the work. This doesn't seem to be a problem if we have a single instance of the master, but imagine scaling out the master; then it immediately gets worse.
So, how can we maintain ordering of the work? Since the internal implementation is based on POJOs and everything is open and customizable, it is actually very easy to implement support for ordering. All we need to do is to swap our map of LinkedBlockingQueue<T>(s) to a map of PriorityBlockingQueue<T> (s).
Then you can let your Work implement Comparable and create a custom Comparator<T> that you pass it into the constructor of the PriorityBlockingQueue<T>:
If you need to maintain ordering of the result then you have to do something similar with the references to the WorkItem(s) that you get when invoking schedule(..) on the WorkManager instance.
Dealing with worker failure
In a reliable distributed system it is important to be able to detect if a worker goes down. (In order to simplify the discussion, let's define a worker as being a JVM.) There are some different ways this could be implemented and we will now discuss some of the different strategies that we can take.
- Heartbeat mechanism - In this strategy each worker has an open connection to the master through which it continuously (and periodically) sends a heartbeat (some sort of "I'm-alive" event) to the master. The master can then detect if the heartbeat for a specific worker stops and can after a certain time period consider the node to be dead. This is for example the way that Google's MapReduce implementation detects worker failure.
- Work timestamp - Here we are adding a timestamp to each pending work item. The master can then periodically peek into the head of each work queue, read the timestamp and and match it to a predefined timeout interval. If it detects that a work item has timed out then it can consider the worker(s) that is polling from the specific queue to be dead.
- Worker "is-alive-lock" - This strategy utilizes the cross-JVM thread coordination that Terracotta enables. The first thing that each worker does when it starts up is spawn a thread that takes a worker specific lock: Then the the worker connects to the master which spawns up a thread that tries to take the exact same lock. Here comes the trick - the master thread will block until the lock is released, something that will only happen if the worker dies:
- Terracotta Server notifications - Since detection of a node failure is such a common problem, an implementation in which the Terracotta clients can subscribe on notifications of node death from the Terracotta Server is underway and will be released in an upcoming version of Open Terracotta.
In all of the strategies we need to take proper action when worker failure has been detected. In our case it would (among other things) mean to reroute all non-completed work to another queue.
Rewrite the Terracotta config
As you perhaps remember from section Very high volumes of data?, in order to make the implementation more scalable we had to change the SingleWorkQueue class to a WorkQueueManager class and swap the single LinkedBlockingQueue to a ConcurrentHashMap with entries containing a routing ID mapped to a LinkedBlockingQueue. When we did that we also changed the name of the field and the name of class holding this field, this is something that needs to be reflected in the Terracotta configuration file:
How to use the POJO Grid?
Here is an example of the usage of the WorkManager, WorkQueueManager, Router and WorkListener abstractions:
To start up a Worker you simply have to create an instance of the Worker, pass in a reference to the DefaultWorkQueueManager and the routing id to use, and finally invoke start():
Don't we need a result queue?
There is no need for a result queue, since the Master is holding on to the WorkItem e.g. the pending work (the work that is in progress) including its result. Terracotta maintains Java's pass-by-reference semantics, the regular (local) reference to the WorkItem that the Master holds, will work transparently across the cluster. This means that a Worker can update the exact same WorkItem and the Master would know about it immediately.
The client usage would roughly be to start up one WorkManager and N number of Workers, each one on a different JVM. Before you start up the WorkManager and Workers you have to enable the Terracotta runtime.
There are two ways you can do this:
Use Terracotta wrapper script
Open Terracotta ships with a startup script that should be used as a drop-in replacement to the regular java command. The name of the script is dso-java and can be found in the bin directory in the distribution. To use the recommended Terracotta startup script:
- Find the dso-java script in the dso/bin directory.
- Replace the regular invocation to java with the invocation to dso-java.
Use JVM options
You can also use additional JVM options for applications inside the web container, perform the following:
- Prepend the Terracotta Boot Jar to the Java bootclasspath.
- Define the path to the Terracotta configuration file.
Here is an example (assuming you are running Windows):
Now we are almost done, but before you spawn up the Master and the Workers we must first start the Terracotta Server. This is done by invoking the start-tc-server script in the dso/bin directory. After you have done that then you can just start up the Master and the Workers (in any order you like).
That is all there is to it.
Benefits of implementing a POJO-based Data Grid
So let's wrap up with a brief discussion of the most immediate benefits of building and using a POJO-based Data Grid. Here are some of the benefits that we value:
- Work with POJOs - You get to work with POJOs, meaning simple and plain Java code. Any POJO can be shared across the cluster and migrated between worker nodes. There is no need for implementing Serializable or any other interface. It is as simple as Java can be.
- Event-driven development - The Master/Worker (and CommonJ WorkManager) pattern leans itself towards event-driven development with no need for explicit threading and guarding. This can simplify the user code immensely compared to the use of explicit thread coordination.
- Simpler development and testing - We have talked about how Open Terracotta allows you to develop an application for a single JVM and then simply cluster it in order to run it on multiple JVMs. This means that you can simplify the development and testing of your Data Grid application by doing the development and testing on your workstation, utilizing multiple threads instead of JVMs and then deploy the application onto multiple nodes at a later stage.
- White Box container implementation - The whole grid implementation is "under your fingers", nothing is really abstracted away from you as a developer. This gives you the freedom to design Master, Worker, routing algorithms, fail-over schemes etc. the way you need, you can customize everything as much as you want.
TODO: update with SVN access
- Download the distribution for the POJO-based Data Grid (Maven 2 project that includes a sample implementation of a distributed web spider):
- Download Terracotta's JVM-clustering technology - Open Source:
- Tutorial that is using the implementation outlined in this article to build a parallel web spider (that is part of the pojo-grid distribution):
- Article - Distributed Computing Made Easy:
- Article - Stateful Session Clustering: Have Your Availability and Scale It Too:
- Documentation for Terracotta's JVM-clustering technology:
- Author's weblog: