Docker

Conducto uses Docker containers to provide portability and scalability. A Docker image is a template that packages your code with fully-defined OS and dependencies, while a container is a running instance of one.

Docker can be intimidating for newcomers and awkward for professionals, so Conducto has a number of features that simplify using it for pipelines.

An image is a template that contains your code and dependencies, and a container is a running instance of one. Images are defined by a Dockerfile, which builds up this execution environment step-by-step. Detailed tutorials can be found elsewhere, but commonly a Dockerfile will have a few components:

  • A base image to build on

FROM python:3.7
  • Actions to change or enhance the environment

RUN pip install pandas
  • Commands to put your own code into the container

COPY path/on/my/computer path/in/image

Image Definition

You can specify images for each node (or defaults for the entire pipeline or subtree) using the image parameter of conducto.Exec(). Here are some examples of useful images.

Extending a base image with packages and user-code

# Make a Docker Image based on python:3.7, using all the files in '.' as
# the build context, and `pip install` conducto and pandas.
import conducto as co
co.Image("python:3.7", copy_dir=".", reqs_py=["conducto", "pandas"])

Auto-building a Dockerfile

# Run `docker build` on '../../Dockerfile`.
import conducto as co
co.Image(dockerfile="../../Dockerfile")

Use a git repo as your build context. Very useful for CI/CD.

# Use the 'main' branch of Conducto's public package as your build context.
import conducto as co
co.Image("python:3.7", copy_branch="main",
         copy_url="https://github.com/conducto/conducto.git")
class conducto.Image(image=None, *, dockerfile=None, docker_build_args=None, context=None, copy_dir=None, copy_url=None, copy_branch=None, docker_auto_workdir=True, reqs_py=None, path_map=None, name=None, pre_built=False)
Parameters
  • image – Specify the base image to start from. Code can be added with various context* variables, and packages with reqs_* variables.

  • dockerfile – Use instead of image and pass a path to a Dockerfile. Relative paths are evaluated starting from the file where this code is written. Unless otherwise specified, it uses the directory of the Dockerfile as the build context

  • docker_build_args – Dict mapping names of arguments to docker –build-args to values

  • docker_auto_workdirbool, default True, set the work-dir to the destination of copy_dir

  • context – Use this to specify a custom docker build context when using dockerfile.

  • copy_dir – Path to a directory. All files in that directory (and its subdirectories) will be copied into the generated Docker image.

  • copy_url – URL to a Git repo. Conducto will clone it and copy its contents into the generated Docker image. Authenticate to private GitHub repos with a URL like https://{user}:{token}@github.com/…. See secrets for more info on how to store this securely. Must also specify copy_branch.

  • copy_branch – A specific branch name to clone. Required if using copy_url.

  • path_map – Dict that maps external_path to internal_path. Needed for live debug and conducto.Lazy(). It can be inferred from copy_dir; if not using that, you must specify path_map.

  • reqs_py – List of Python packages for Conducto to pip install into the generated Docker image.

  • name – Name this Image so other Nodes can reference it by name. If no name is given, one will automatically be generated from a list of our favorite Pokemon. I choose you, angry-bulbasaur!

Named Images

Sometimes it is useful to specify the image_name the construction of a conducto.Node rather than the image object itself. The following code snippets are equivalent, but when using conducto.lazy_py(), it may be useful to reference by name.

import conducto as co
root = co.Parallel()
root.register_image(co.Image("python:3.8", copy_dir=".", name="base_python"))
root.register_image(co.Image("ruby:2.7", copy_dir=".", name="base_ruby"))
root["RunPython"] = co.Exec("python -c 'print(\"I am running in python\")'", image_name="base_python")
root["RunRuby"] = co.Exec("ruby -e 'puts \"I am doing some ruby\"'", image_name="base_ruby")
import conducto as co
root = co.Parallel()
python_image = co.Image("python:3.8", copy_dir=".")
ruby_image = co.Image("ruby:2.7", copy_dir=".")
root["RunPython"] = co.Exec("python -c 'print(\"I am running in python\")'", image=python_image)
root["RunRuby"] = co.Exec("ruby -e 'puts \"I am doing some ruby\"'", image=ruby_image)
conducto.Node.register_image(self, image: conducto.image.Image)

Register a named Image for use by descendant Nodes that specify image_name. This is especially useful with lazy pipeline creation to ensure that the correct base image is used.

Parameters

imageconducto.Image

Image Path Translation

The parameters copy_dir, copy_url and copy_branch take care of many of the simple cases for image path translation. If the path cannot be inferred you can declare mappings via the path_map parameter. There are two cases where path_map is helpful:

  • with copy_url and copy_branch, specify the local path of the checked-out source.

  • if your dockerfile for the image contains a COPY line, you may wish to specify the external and internal paths to enable binding for live debug.

conducto.relpath(path)

Construct a path with decoration to enable translation inside a docker image for a node. This may be used to construct path parameters to a command line tool.

This is used internally by conducto.Exec when used with a Python callable to construct the command line which executes that callable in the pipeline.

Running Exec Nodes

Each Exec node runs in a container, but multiple Exec nodes may share a single container. Conducto provides a few modes for controlling this behavior.

Default: each Exec node usually gets its own container

Normally, Conducto runs each Exec node in its own container. For efficiency reasons it may reuse a container - if one Exec node finishes and another in the queue is compatible with the now-available container, Conducto will assign one from the queue to the container.

If you expect each Exec node to run independently and not destructively modify the state of its container, this is a great default choice.

Run Exec nodes in a single container

Cases do exist where you want to build up local state over the course of a few nodes. This example starts by installing the python redis package into the container, then uses the newly installed package to read and write data to a redis-server. These steps must all run in the same container, or else the read & write steps would not be able to see the redis package.

import conducto as co
with co.Serial(container_reuse_context=co.ContainerReuseContext.NEW) as test:
    test["Install"] = co.Exec("pip install redis")
    test["Write"] = co.Exec("...")
    test["Read"] = co.Exec("...")

To instruct Conducto that these nodes must share a container, create a new “same container” context: container_reuse_context=co.ContainerReuseContext.NEW. All child nodes below this that have the default of container_reuse_context=None will share this container.

Another use of ContainerReuseContext.NEW is to start a server in one Exec node, and then run a test against it in the next Exec node. Alternatively, you could put these commands in a single Exec node, connected with &&. But, separating them into multiple Exec nodes improves clarity by giving you separate outputs for each command, making debugging easier.

Note: you can also use this feature if you simply want to disable container reuse and ensure that each Exec node gets its own container.