SEEP is an experimental parallel data processing system that is being developed by the Large-Scale Distributed Systems (LSDS) research group (http://lsds.doc.ic.ac.uk) at Imperial College London. It is licensed under EPL (Eclipse Public License).
The SEEP system is under heavy development and should be considered an alpha release. This is not considered a "stable" branch.
Further details on SEEP, including papers that explain the underlying model can be found at the project website: http://lsds.doc.ic.ac.uk/projects/SEEP
The SEEP system consists of two modules, the runtime system (seep-system) and a compiler (java2sdg). Below is some information regarding how to build the system and modules.
The project follows the standard Maven directory structure, with two differentiated modules, seep-system and seep-java2sdg.
There are two options to build the SEEP system:
Option 1, single jar (recommended) -- run:
mvn clean compile assembly:single
This produces one jar with all dependencies included.
Option 2, without dependencies -- to compile it:
mvn -DskipTests package
In this case, ensure that the classpath includes the dependencies.
You can alternatively build only individual modules, by running the same options above inside seep-system or seep-java2sdg, respectively.
The system requires one master node and N worker nodes (one worker node per Operator).
First set the IP address of the master node in "mainAddr" inside config.properties and build the SEEP system.
Next run the master in the designated node:
java -jar <system.jar> Master <query.jar> <Base-class>
where query.jar is the compiled query and the last parameter is the name of the base class, not a path.
Finally run as many worker nodes as your query requires:
java -jar <system.jar> Worker
To run the SEEP system in a single local machine, append a different port to each Worker node:
java -jar <system.jar> Worker <port>
It is mandatory to indicate an input program, an output file name and a target (dot/seepjar) and the classpath to the driver program and its dependencies. Examples:
java -jar <java2sdg.jar> -i Driver -t dot -o myOutput -cp examples/
The above code will process input program "Driver" using the dependencies in "examples/" to generate an output file "myOutput.dot".
This tutorial outlines how to run a simple example application in SEEP locally on a single machine, starting from a fresh clone of the source code. Below, we assume that refers to the name of the directory to which the SEEP code has been cloned and is the full absolute path to this directory.
1) We inspect the application and configure the system.
To get an overview of the example application, open /seep-system/examples/stateless-simple-query/src/Base.java in your favorite editor and observe the following:
- the method compose defines the application and returns a QueryPlan
- the application contains three operators, each is a Connectable that is instantiated with a different functionality, which is given as a class in the constructor of Connectable
- for each Connectable, the input schema has to be provided as a list of String attributes
- the created Connectable objects are then connected to build the actual data flow graph, which, in this example, is a sequence Source -> Processor -> Sink
- the actual operator implementations can be found in the classes Source, Processor, Sink in package operators
For the configuration, use your favorite editor and open /seep-system/src/main/resources/config.properties . Assuming that the system should run locally make sure that the file states:
mainAddr = 127.0.0.1 mainPort = 3500
2) We build the system, copy the snapshot of the system to the example application, and build the example application:
cd <Path to SEEP Dir>/<SEEP Dir> ./build.sh mkdir seep-system/examples/stateless-simple-query/lib cp seep-system/target/seep-0.0.1-SNAPSHOT.jar seep-system/examples/stateless-simple-query/lib/ cd seep-system/examples/stateless-simple-query ant
3) We start the system locally. Open four terminals to start one Master node and three Worker nodes.
The following snippet should start the master node and bring up a terminal menu:
cd <Path to SEEP Dir>/<SEEP Dir> java -jar seep-system/target/seepystem-0.0.1-SNAPSHOT.jar Master <Path to SEEP Dir>/<SEEP Dir>/seep-system/examples/stateless-simple-query/dist/stateless-simple-query.jar Base
Now, we start the three workers (note that each is assigned a different port, given as the last parameter):
cd <Path to SEEP Dir>/<SEEP Dir> java -jar seep-system/target/seepystem-0.0.1-SNAPSHOT.jar Worker 3501
cd <Path to SEEP Dir>/<SEEP Dir> java -jar seep-system/target/seepystem-0.0.1-SNAPSHOT.jar Worker 3502
cd <Path to SEEP Dir>/<SEEP Dir> java -jar seep-system/target/seepystem-0.0.1-SNAPSHOT.jar Worker 3503
After that the terminal running the Master node should show messages that indicate that the workers connected to the master.
In the terminal for the master, insert the following (1 deploys the code to the workers, 2 starts execution, after an additional enter, the actual data is streamed):
1 2 <Enter>
Now, one of the terminals running a worker node should print out a statement starting with "SNK" roughly every second. This worker runs the Sink operator and the output indicates the time elapsed (first number, in seconds) and the number of tuples received (second number).