Controlling a Pipeline

Overview

Whether a pipeline is running locally, in the cloud, or on a teammate's machine, conducto.com/app is the place to view and control it. This page will show you how to use the Conducto web app to control pipelines.

This page is mostly about what to do once you already have a pipeline definition. If you want to know about creating pipeline definitions, check out Pipeline Structure. It also assumes that you have Conducto (and its dependencies) installed. If you're not sure about this, head over to Getting Started.

We'll start with a definition containing three of nodes that do the same job in different ways. Then we'll create a pipeline from that definition and modify its node parameters. Our modifications will change which node completes faster.

Create a Pipeline

Our example pipelines are on GitHub. This page uses the one called "Compression Race".

The commands below have access to GNU Parallel, GZIP, and payload files containing between 50 MB and 250 MB of noise. It's not necessary, but if you're curious about how these things got there, feel free to skip ahead to the next section where you'll learn about Dockerfiles.

Let's clone it and look at its structure:

git clone https://github.com/conducto/examples
cd examples/compression_race
python pipeline.py

    /
    ├─0 container parallelism
    │ ├─ worker 0   gzip -1 payload1.dat
    │ ├─ worker 1   gzip -1 payload2.dat
    │ ├─ worker 2   gzip -1 payload3.dat
    │ ├─ worker 3   gzip -1 payload4.dat
    │ └─ worker 4   gzip -1 payload5.dat
    ├─1 process parallelism   ls payload*.dat | parallel gzip -1
    └─2 no parallelism   gzip -1 payload*.dat

This pipeline encapsulates three strategies for compressing five files:

  1. In five containers
  2. In a single container, with five processes in parallel
  3. In a single container, sequentially in a single processs

We can create a pipeline for this definition by running the pipeline.py script and passing --local to indicate our local Docker daemon.

python pipeline.py --local

Conducto will look at the definition and create a pipeline with a random name (mine was 'bei-hou'). Then it will direct you to view it in a browser at a URL like www.conducto.com/app/p/bei-hou.

It's not necessary, but if you're curious about how these things got there, feel free to skip ahead to the next section where you'll learn about Dockerfiles.

Initially, all nodes will be pending, and the 'Run' toggle will be unset.

A fresh pipeline shown via the web app

The node parameters you see on the right go with the currently selected node. This pipeline is set up like a race--you can click the pencil icons to change how any cpu cores each node has access to, which will change who wins.

If you watch the output of docker ps and click the 'Run' in your browser, you can see conducto run the pipeline in a series of containers.

watch docker ps

    CONTAINER ID  COMMAND                NAMES
    ad0649b65166  "sh -c '/tmp/conduct…" conducto_worker_bei-hou
    8325fa119ab6  "sh -c '/tmp/conduct…" conducto_worker_bei-hou
    09d2c5cc08c0  "python -m manager.s…" conducto_manager_bei-hou

The Conducto web app will display the results of the run as they become available.

A running pipeline shown via the web app

Ideally you'll see several green checkmarks, which means that the command run by each node has returned 0. Click on a node to the left to see how long it ran before completing.

Initial run

Since we set the node parameter cpu=2 on the root node, each node gets access to two cpu cores. So if your machine has ten cores available, then "container parallelism" was probably the fastest racer--since each of the five containers had two processor cores to itself. If fewer cores were available, some children of the Parallel node might not have started until others completed and freed up a core. In this case, maybe "process parallelism" won the race.

This wasn't really a fair race--one strategy used ten cpu cores (two per container), but the other had to do all the work with just two cores. Let's modify the underpowered node to give it ten cores also:

Changing Node Parameters

Having already run this pipeline once, we already have half of the results we need, so we'll only be rerunning one node. But first, we need to correct the number of cpu cores it has access to.

Adding more cores to a node

Since the Run toggle is enabled, the node will automatically rerun as soon as you reset it. When it completes, you can compare the times to see how much faster it was.

Result Comparison

It takes longer to start a container than it does to start a process, so it's not surprising that five parallel processes are faster than five parallel containers. As the job grows in complexity, however, the additional transparency and control that Conducto provides quickly becomes worth it.

For instance, you can expand the container parallelism node, skip some of its children, and reset it. Conducto keeps the data from previous runs around for comparison, so you can see how much the skipped children affected run time.

Skipped Nodes

You can also modify a command, reset the node, and see which execution took longer or used more memory.

Modify Command

Clicking on the various entries in the timeline shows how the timestamped command differs from the current value.

When you're done interacting with this pipeline, you can put it to sleep.

Top Bar with Sleep Button

The "Pipelines" tab at the top will let you see your sleeping pipelines. By default, they stick around for seven days, during that window you can wake them up and continue tinkering.

Containers in the Cloud

So far we have been using conducto.com/app to manipulate a pipeline whose nodes were executing commands in containers on our local machine. This can be a powerful arrangement because can control of that computation from anywhere. Since it's your machine doing the heavy lifting, we make this workflow available for free.

You might find it convenient to let us handle the computation too. Your organization will need to have billing set up for this, but you can launch pipelines with --cloud and Conducto will handle the resources for you.

Cloud Mode

My machine doesn't have 50 processor cores, but cloud mode handles that sort of thing.

If you want to play with a variable number of workers like this, you'll find scale.py in the same directory as pipeline.py in the examples repo. Unlike pipeline.py, it will accept a number of workers from the command line:

python scale.py race 50 --cloud

Conclusion

In the sections above, we explored the ways that you can take an existing pipeline and modify how much computing power each node has access to. We used the Conducto web app to tweak and rerun certain nodes, and to compare results between runs. We also learned about local mode and cloud mode.

There are many other node parameters besides cpu that you can use to control pipelines, but working with them is similar. They are documented in a later section, but if you followed along here, you're probably already prepared to play around with them. You can find a list of them on the Node base class

The examples in this section used a custom image so that tools like gzip and parallel were where we needed them. Doing so is a good way to keep the commands in your Exec nodes simple. In the Images section, you'll learn how to customize images for your own pipelines.

Concepts

Example Pipelines

API's