Controlling a Pipeline
Whether a pipeline is running locally, in the cloud, or on a teammate's machine, conducto.com/app is the place to view and control it. This page will show you how to use the Conducto web app to control pipelines.
This page is mostly about what to do once you already have a pipeline definition. If you want to know about creating pipeline definitions, check out Pipeline Structure. It also assumes that you have Conducto (and its dependencies) installed. If you're not sure about this, head over to Getting Started.
We'll start with a definition containing three of nodes that do the same job in different ways. Then we'll create a pipeline from that definition and modify its node parameters. Our modifications will change which node completes faster.
Our example pipelines are on GitHub. This page uses the one called "Compression Race".
The commands below have access to GNU Parallel, GZIP, and payload files containing between 50 MB and 250 MB of noise. It's not necessary, but if you're curious about how these things got there, feel free to skip ahead to the next section where you'll learn about Dockerfiles.
Let's clone it and look at its structure:
git clone https://github.com/conducto/examples cd examples/compression_race python pipeline.py / ├─0 container parallelism │ ├─ worker 0 gzip -1 payload1.dat │ ├─ worker 1 gzip -1 payload2.dat │ ├─ worker 2 gzip -1 payload3.dat │ ├─ worker 3 gzip -1 payload4.dat │ └─ worker 4 gzip -1 payload5.dat ├─1 process parallelism ls payload*.dat | parallel gzip -1 └─2 no parallelism gzip -1 payload*.dat
This pipeline encapsulates three strategies for compressing five files:
- In five containers
- In a single container, with five processes in parallel
- In a single container, sequentially in a single processs
We can create a pipeline for this definition by running the
pipeline.py script and passing
--local to indicate our local Docker daemon.
python pipeline.py --local
Conducto will look at the definition and create a pipeline with a random name (mine was 'bei-hou').
Then it will direct you to view it in a browser at a URL like
It's not necessary, but if you're curious about how these things got there, feel free to skip ahead to the next section where you'll learn about Dockerfiles.
Initially, all nodes will be pending, and the 'Run' toggle will be unset.
The node parameters you see on the right go with the currently selected node. This pipeline is set up like a race--you can click the pencil icons to change how any cpu cores each node has access to, which will change who wins.
If you watch the output of
docker ps and click the 'Run' in your browser, you can see conducto run the pipeline in a series of containers.
watch docker ps CONTAINER ID COMMAND NAMES ad0649b65166 "sh -c '/tmp/conduct…" conducto_worker_bei-hou 8325fa119ab6 "sh -c '/tmp/conduct…" conducto_worker_bei-hou 09d2c5cc08c0 "python -m manager.s…" conducto_manager_bei-hou
The Conducto web app will display the results of the run as they become available.
Ideally you'll see several green checkmarks, which means that the command run by each node has returned 0. Click on a node to the left to see how long it ran before completing.
Since we set the node parameter
cpu=2 on the root node, each node gets access to two cpu cores.
So if your machine has ten cores available, then "container parallelism" was probably the fastest racer--since each of the five containers had two processor cores to itself.
If fewer cores were available, some children of the
Parallel node might not have started until others completed and freed up a core.
In this case, maybe "process parallelism" won the race.
This wasn't really a fair race--one strategy used ten cpu cores (two per container), but the other had to do all the work with just two cores. Let's modify the underpowered node to give it ten cores also:
Having already run this pipeline once, we already have half of the results we need, so we'll only be rerunning one node. But first, we need to correct the number of cpu cores it has access to.
Since the Run toggle is enabled, the node will automatically rerun as soon as you reset it. When it completes, you can compare the times to see how much faster it was.
It takes longer to start a container than it does to start a process, so it's not surprising that five parallel processes are faster than five parallel containers. As the job grows in complexity, however, the additional transparency and control that Conducto provides quickly becomes worth it.
For instance, you can expand the
container parallelism node, skip some of its children, and reset it.
Conducto keeps the data from previous runs around for comparison, so you can see how much the skipped children affected run time.
You can also modify a command, reset the node, and see which execution took longer or used more memory.
Clicking on the various entries in the timeline shows how the timestamped command differs from the current value.
When you're done interacting with this pipeline, you can put it to sleep.
The "Pipelines" tab at the top will let you see your sleeping pipelines. By default, they stick around for seven days, during that window you can wake them up and continue tinkering.
So far we have been using conducto.com/app to manipulate a pipeline whose nodes were executing commands in containers on our local machine. This can be a powerful arrangement because can control of that computation from anywhere. Since it's your machine doing the heavy lifting, we make this workflow available for free.
You might find it convenient to let us handle the computation too.
Your organization will need to have billing set up for this, but you can launch pipelines with
--cloud and Conducto will handle the resources for you.
My machine doesn't have 50 processor cores, but cloud mode handles that sort of thing.
If you want to play with a variable number of workers like this, you'll find scale.py in the same directory as pipeline.py in the examples repo.
pipeline.py, it will accept a number of workers from the command line:
python scale.py race 50 --cloud
In the sections above, we explored the ways that you can take an existing pipeline and modify how much computing power each node has access to. We used the Conducto web app to tweak and rerun certain nodes, and to compare results between runs. We also learned about local mode and cloud mode.
There are many other node parameters besides
cpu that you can use to control pipelines, but working with them is similar.
They are documented in a later section, but if you followed along here, you're probably already prepared to play around with them.
You can find a list of them on the Node base class
The examples in this section used a custom image so that tools like
parallel were where we needed them.
Doing so is a good way to keep the commands in your Exec nodes simple.
In the Images section, you'll learn how to customize images for your own pipelines.