SEEP - Real-Time Stateful Big Data Processing

Overview

SEEP is an experimental parallel data processing system that is being developed by the Large-Scale Distributed Systems (LSDS) research group (http://lsds.doc.ic.ac.uk) at Imperial College London. It is licensed under EPL (Eclipse Public License).

The SEEP system is under heavy development and should be considered an alpha release. This is not considered a "stable" branch.

Further details on SEEP, including papers that explain the underlying model can be found at the project website: http://lsds.doc.ic.ac.uk/projects/SEEP

The SEEP system consists of two modules, the runtime system (seep-system) and a compiler (java2sdg). Below is some information regarding how to build the system and modules.

Building it

The project follows the standard Maven directory structure, with two differentiated modules, seep-system and seep-java2sdg.

There are two options to build the SEEP system:

Option 1, single jar (recommended) -- run:

mvn clean compile assembly:single

This produces one jar with all dependencies included.

Option 2, without dependencies -- to compile it:

mvn -DskipTests package

In this case, ensure that the classpath includes the dependencies.

You can alternatively build only individual modules, by running the same options above inside seep-system or seep-java2sdg, respectively.

Running it

Seep System

The system requires one master node and N worker nodes (one worker node per Operator).

First set the IP address of the master node in "mainAddr" inside config.properties and build the SEEP system.

Next run the master in the designated node:

java -jar <system.jar> Master <query.jar> <Base-class>

where query.jar is the compiled query and the last parameter is the name of the base class, not a path.

Finally run as many worker nodes as your query requires:

java -jar <system.jar> Worker

Local mode:

To run the SEEP system in a single local machine, append a different port to each Worker node:

java -jar <system.jar> Worker <port>

Seep Java2sdg

It is mandatory to indicate an input program, an output file name and a target (dot/seepjar) and the classpath to the driver program and its dependencies. Examples:

java -jar <java2sdg.jar> -i Driver -t dot -o myOutput -cp examples/

The above code will process input program "Driver" using the dependencies in "examples/" to generate an output file "myOutput.dot".

Mini Tutorial

This tutorial outlines how to run a simple example application in SEEP locally on a single machine, starting from a fresh clone of the source code. Below, we assume that refers to the name of the directory to which the SEEP code has been cloned and is the full absolute path to this directory.

1) We inspect the application and configure the system.

To get an overview of the example application, open /seep-system/examples/stateless-simple-query/src/Base.java in your favorite editor and observe the following:

the method compose defines the application and returns a QueryPlan
the application contains three operators, each is a Connectable that is instantiated with a different functionality, which is given as a class in the constructor of Connectable
for each Connectable, the input schema has to be provided as a list of String attributes
the created Connectable objects are then connected to build the actual data flow graph, which, in this example, is a sequence Source -> Processor -> Sink
the actual operator implementations can be found in the classes Source, Processor, Sink in package operators

For the configuration, use your favorite editor and open /seep-system/src/main/resources/config.properties . Assuming that the system should run locally make sure that the file states:

mainAddr = 127.0.0.1
mainPort = 3500

2) We build the system, copy the snapshot of the system to the example application, and build the example application:

cd <Path to SEEP Dir>/<SEEP Dir>
./build.sh
mkdir seep-system/examples/stateless-simple-query/lib
cp seep-system/target/seep-0.0.1-SNAPSHOT.jar seep-system/examples/stateless-simple-query/lib/
cd seep-system/examples/stateless-simple-query
ant

3) We start the system locally. Open four terminals to start one Master node and three Worker nodes.

The following snippet should start the master node and bring up a terminal menu:

cd <Path to SEEP Dir>/<SEEP Dir>
java -jar seep-system/target/seepystem-0.0.1-SNAPSHOT.jar Master <Path to SEEP Dir>/<SEEP Dir>/seep-system/examples/stateless-simple-query/dist/stateless-simple-query.jar Base

Now, we start the three workers (note that each is assigned a different port, given as the last parameter):

cd <Path to SEEP Dir>/<SEEP Dir>
java -jar seep-system/target/seepystem-0.0.1-SNAPSHOT.jar Worker 3501

cd <Path to SEEP Dir>/<SEEP Dir>
java -jar seep-system/target/seepystem-0.0.1-SNAPSHOT.jar Worker 3502

cd <Path to SEEP Dir>/<SEEP Dir>
java -jar seep-system/target/seepystem-0.0.1-SNAPSHOT.jar Worker 3503

After that the terminal running the Master node should show messages that indicate that the workers connected to the master.

In the terminal for the master, insert the following (1 deploys the code to the workers, 2 starts execution, after an additional enter, the actual data is streamed):

1
2
<Enter>

Now, one of the terminals running a worker node should print out a statement starting with "SNK" roughly every second. This worker runs the Sink operator and the output indicates the time elapsed (first number, in seconds) and the number of tuples received (second number).