Using Operon

The primary user interaction with Operon is through the command line interface, exposed as the executable operon. Subcommands are provided to install, manage, and run pipelines compatible with the Operon framework.

Initialization

Operon keeps track of pipelines and configurations, among other metadata, in a hidden directory. Before Operon can be used, this directory structure needs to be initialized:

$ operon init [init-path]

Where init-path is an optional path pointing to the location where the hidden .operon folder should be initialized. If this path isn’t given it will default to the user’s home directory.

By default, whenever Operon needs to access the .operon folder it will look in the user’s home directory. If the .operon folder has been initialized elsewhere, there must be a shell environment variable OPERON_HOME which points to the directory containing the .operon folder.

Note

If a path other than the user’s home directory is given to the init subprogram, it will attempt to add the OPERON_HOME environment variable to the user’s shell session in ~/.bashrc, ~/.bash_profile, or ~/.profile.

After a successful initialization, a new shell session should be started for tab-completion and the OPERON_HOME environment variable to take effect.

Pipeline Installation

To install an Operon compatible pipeline into the Operon system:

$ operon install /path/to/pipeline.py [-y]

The pipeline file will be copied into the Operon system and optionally Python package dependencies, as specified by the pipeline just installed, can be installed into the current Python environment using pip.

Caution

If install attempts to install Python package dependencies, it will attempt to do so using the --upgrade flag to pip. If in the current Python environment those packages already exist, they will be either upgraded or downgraded, which may cause other software to stop functioning properly.

Pipeline Configuration

To configure an Operon pipeline with platform-static values and optionally use Miniconda to install software executables that the pipeline uses:

$ operon configure <pipeline-name> [-h] [--location LOCATION] [--blank]

If this is the first time the pipeline has been configured and Miniconda is found in PATH, then the configure subprogram will attempt to create a new conda environment, install software instances that the pipeline uses, then inject those software paths into the next configuration step. If a conda environment for this pipeline has been created before, configure can attempt to inject those software paths instead.

For the configuration step, Operon will ask the user to provide values for the pipeline which will not change from run to run such as software paths, paths to reference files, etc. The question is followed by a value in brackets ([]), which is the used value if no input is provided. If a conda environment is used, this value in brackets will be the injected software path.

By default, the configuration file is written into the .operon folder where it will automatically be called up when the user runs operon run. If --location is given as a path, the configuration file will be written out there instead.

Seeing Pipeline States

To see all pipelines in the Operon system and whether each has a corresponding configuration file:

$ operon list

To see detailed information about a particular pipeline, such as current configuration, command line options, any required dependencies, etc:

$ operon show <pipeline-name>

Run a Pipeline

To run an installed pipeline:

$ operon run <pipeline-name> [--pipeline-config CONFIG] [--parsl-config CONFIG] \
                             [--logs-dir DIR] [pipeline-options]

The set of accepted pipeline-options is defined by the pipeline itself and are meant to be values that change from run to run, such as input files, metadata, etc. Three options will always exist:

  • --pipeline-config can point to a pipeline config to use for this run only
  • --parsl-config can point to a Python file that represents a Parsl config to use for this run only (see below)
  • --logs-dir can point to a location where log files from this run should be deposited; if it doesn’t exist, it will be created; defaults to the current directory
  • --run-name gives a name to the run, which will be used in the log filename and helps differentiate this run from other runs

When an Operon pipeline is run, under the hood it creates a Parsl workflow which can be exectuted in different ways depending on the accompanying Parsl configuration. This means that while the definition for a pipeline run with the run subprogram is consistent, the actual execution model may vary if the Parsl configuration varies.

Parsl Configuration

Parsl is the package the powers Operon and and is responsible for Operon’s powerful and flexible parallel execution. Operon itself is only a front-end abstraction of a Parsl workflow; the actual execution model is fully Parsl-specific and as such it’s advised to check out the Parsl documentation to get a sense for how to design a Parsl configuration for a specific need-case.

The Parsl configuration must be specified in a Python file where the variable name config contains an object of type parsl.config.Config:

from parsl.config import Config

config = Config(
    executors=[...],
    lazy_errors=True,
    retries=10
)

The run subprogram attempts to pull a Parsl configuration from the user in the following order:

  1. A path from the command line argument --parsl-config
  2. A path from the pipeline configuration key parsl_config
  3. A package default Parsl configuration of 2 workers using Python threads

The Parsl configuration can contain multiple executors, each with different models of execution and different available resources. If a multiexecutor Parsl configuration is provided to Operon, it will try to match up the executor names as best as possible and execute software on appropriate sites. Any software which can’t find a Parsl configuration executor match will run in a random executor. The set of executor names the pipeline expects is output as a part of operon show.

For more detailed information, refer to the Parsl documentation on the subject.

Run a Pipeline in Batch

A common use case is to run many samples or input units independently through the same pipeline. The batch-run subcommand allows this use case and gives the whole run a common pool of resouces:

$ operon batch-run <pipeline-name> --input-matrix INPUT_MATRIX [--pipeline-config CONFIG] \
                                   [--parsl-config CONFIG] [--logs-dir DIR]

Operon treats a batch-run like a single large workflow which happens to contains many disjoint sub-workflows. Every node in the workflow graph is given equal access to a pool of resources so those resources are used most efficiently.

Input Matrix

Passing inputs into a batch-run isn’t done on the command line but rather is pre-gathered into a tab-separated matrix file of a specific format. The following formats are supported:

With Headers

The header line should be a tab separated list of command line argument flags in the same format as one would use when directly typing on the command line. Optional arguments should use their verbatim flags, and positional arguments should use the form positional_i, where i is the position from left-most to right-most. Subsequent lines should have the same number of tab separated items, where each item is the value for its corresponding header.

Singleton arguments (where its presence or lack thereof denotes its value) can be specified in their affirmative form in the header line. The values given should be either true or false, which corresponds to whether they should be included or not.

--arg1  --inputs    --singleton positional_0    positional_1
val1    /path/to/input1 true    apples  blue
val3    /path/to/inputN true    strawberries    green
val2    /path/to/inputABB   false   kale    purple

Note

If the literal string "true" or "false" is needed, preface with a # as in #true.

Without Headers

If the flag --literal-input is given to batch-run, then the header line does not need to exist and each line is taken as a literal command line string which will be interpreted as if typed directly into the command line (starting with arguments to the pipeline).

--arg1 val1 --inputs /path/to/input1 --singleton apples blue
--arg1 val3 --inputs /path/to/inputN --singleton strawberries green
--arg1 val2 --inputs /path/to/inputABB kale purple

Command Line Help

All subcommands can be followed by a -h, --help, or help to get a more detailed explanation for how it should be used.