Within certain data analysis fields, there are certain intermediate results that reoccur in exactly the same way in many analysis.
For example, in bioinformatics, reference genomes or annotations are downloaded, and read mapping indexes are built.
...
...
@@ -23,7 +25,8 @@ The environment variable definition that happens in the first line (defining the
When Snakemake is executed without a shared filesystem (e.g., in the cloud, see :ref:`cloud`), the environment variable has to point to a location compatible with the given remote provider (e.g. an S3 or Google Storage bucket).
In any case, the provided location should be shared between all workflows of your group, institute or computing environment, in order to benefit from the reuse of previously obtained intermediate results.
Note that only rules with just a single output file are eligible for caching.
Note that only rules with just a single output file (or directory) or with :ref:`multiext output files <snakefiles-multiext>` are eligible for caching.
The reason is that for other rules it would be impossible to unambiguously assign the output files to cache entrys while being agnostic of the actual file names.
Also note that the rules need to retrieve all their parameters via the ``params`` directive (except input files).
It is not allowed to directly use ``wildcards``, ``config`` or any global variable in the shell command or script, because these are not captured in the hash (otherwise, reuse would be unnecessarily limited).
@@ -82,7 +82,7 @@ Rules describe how to create **output files** from **input files**.
* Input and output files can contain multiple named wildcards.
* Rules can either use shell commands, plain Python code or external Python or R scripts to create output files from input files.
* Snakemake workflows can be easily executed on **workstations**, **clusters**, **the grid**, and **in the cloud** without modification. The job scheduling can be constrained by arbitrary resources like e.g. available CPU cores, memory or GPUs.
* Snakemake can automatically deploy required software dependencies of a workflow using `Conda <https://conda.io>`_ or `Singularity <http://singularity.lbl.gov/>`_.
* Snakemake can automatically deploy required software dependencies of a workflow using `Conda <https://conda.io>`_ or `Singularity <https://sylabs.io/docs/>`_.
* Snakemake can use Amazon S3, Google Storage, Dropbox, FTP, WebDAV, SFTP and iRODS to access input or output files and further access input files via HTTP and HTTPS.
Then, a workflow can be deployed to a new system via the following steps
...
...
@@ -29,7 +39,7 @@ Then, a workflow can be deployed to a new system via the following steps
cd path/to/workdir
# edit config and workflow as needed
vim config.yaml
vim config/config.yaml
# execute workflow, deploy software dependencies via conda
snakemake -n --use-conda
...
...
@@ -89,6 +99,7 @@ The path to the environment definition is interpreted as **relative to the Snake
Snakemake will store the environment persistently in ``.snakemake/conda/$hash`` with ``$hash`` being the MD5 hash of the environment definition file content. This way, updates to the environment definition are automatically detected.
Note that you need to clean up environments manually for now. However, in many cases they are lightweight and consist of symlinks to your central conda installation.
Conda deployment also works well for offline or air-gapped environments. Running ``snakemake -n --use-conda --create-envs-only`` will only install the required conda environments without running the full workflow. Subsequent runs with ``--use-conda`` will make use of the local environments without requiring internet access.
.. _singularity:
...
...
@@ -174,6 +185,34 @@ The user can, upon execution, freely choose the desired level of reproducibility
* Conda based package management (use versions defined by the workflow developer)
* Conda based package management in containerized OS (use versions and OS defined by the workflow developer)
-------------------------
Using environment modules
-------------------------
In high performace cluster systems (HPC), it can be preferable to use environment modules for deployment of optimized versions of certain standard tools.
Snakemake allows to define environment modules per rule:
.. code-block:: python
rule bwa:
input:
"genome.fa"
"reads.fq"
output:
"mapped.bam"
conda:
"envs/bwa.yaml"
envmodules:
"bio/bwa/0.7.9",
"bio/samtools/1.9"
shell:
"bwa mem {input} | samtools view -Sbh - > {output}"
Here, when Snakemake is executed with `snakemake --use-envmodules`, it will load the defined modules in the given order, instead of using the also defined conda environment.
Note that although not mandatory, one should always provide either a conda environment or a container (see above), along with environment module definitions.
The reason is that environment modules are often highly platform specific, and cannot be assumed to be available somewhere else, thereby limiting reproducibility.
By definition an equivalent conda environment or container as a fallback, people outside of the HPC system where the workflow has been designed can still execute it, e.g. by running `snakemake --use-conda` instead of `snakemake --use-envmodules`.
@@ -120,74 +120,121 @@ Finally, you can also define global wildcard constraints that apply for all rule
See the `Python documentation on regular expressions <http://docs.python.org/py3k/library/re.html>`_ for detailed information on regular expression syntax.
.. _snakefiles-targets:
Targets
-------
Aggregation
-----------
By default snakemake executes the first rule in the snakefile. This gives rise to pseudo-rules at the beginning of the file that can be used to define build-targets similar to GNU Make:
Input files can be Python lists, allowing to easily aggregate over parameters or samples:
.. code-block:: python
rule all:
input: ["{dataset}/file.A.txt".format(dataset=dataset) for dataset in DATASETS]
rule aggregate:
input:
["{dataset}/a.txt".format(dataset=dataset) for dataset in DATASETS]
output:
"aggregated.txt"
shell:
...
Here, for each dataset in a python list ``DATASETS`` defined before, the file ``{dataset}/file.A.txt`` is requested. In this example, Snakemake recognizes automatically that these can be created by multiple applications of the rule ``complex_conversion`` shown above.
Above expression can be simplified in two ways.
Above expression can be simplified to the following:
This may be used for "aggregation" rules for which files from multiple or all datasets are needed to produce a specific output (say, *allSamplesSummary.pdf*).
Note that *dataset* is NOT a wildcard here because it is resolved by Snakemake due to the ``expand`` statement (see below also for more information).
rule aggregate:
input:
expand("{dataset}/a.txt", dataset=DATASETS)
output:
"aggregated.txt"
shell:
...
Note that *dataset* is NOT a wildcard here because it is resolved by Snakemake due to the ``expand`` statement.
The ``expand`` function thereby allows also to combine different variables, e.g.
If now ``PLOTFORMATS=["pdf", "png"]`` contains a list of desired output formats then expand will automatically combine any dataset with any of these extensions.
If now ``FORMATS=["txt", "csv"]`` contains a list of desired output formats then expand will automatically combine any dataset with any of these extensions.
Further, the first argument can also be a list of strings. In that case, the transformation is applied to all elements of the list. E.g.
Per default, ``expand`` uses the python itertools function ``product`` that yields all combinations of the provided wildcard values. However by inserting a second positional argument this can be replaced by any combinatoric function, e.g. ``zip``:
will create strings with all values for ext but starting with the wildcard ``"{dataset}"``.
.. _snakefiles-multiext:
The multiext function
~~~~~~~~~~~~~~~~~~~~~
``multiext`` provides a simplified variant of ``expand`` that allows to define a set of output or input files that just differ by their extension:
.. code-block:: python
rule plot:
input:
...
output:
multiext("some/plot", ".pdf", ".svg", ".png")
shell:
...
The effect is the same as if you would write ``expand("some/plot.{ext}", ext=[".pdf", ".svg", ".png"])``, however, using a simpler syntax.
Moreover, defining output with ``multiext`` is the only way to use :ref:`between workflow caching <caching>` for rules with multiple output files.
will create strings with all values for ext but starting with ``"{dataset}"``.
.. _snakefiles-targets:
Targets and aggregation
-----------------------
By default snakemake executes the first rule in the snakefile. This gives rise to pseudo-rules at the beginning of the file that can be used to define build-targets similar to GNU Make:
.. code-block:: python
rule all:
input: ["{dataset}/file.A.txt".format(dataset=dataset) for dataset in DATASETS]
Here, for each dataset in a python list ``DATASETS`` defined before, the file ``{dataset}/file.A.txt`` is requested. In this example, Snakemake recognizes automatically that these can be created by multiple applications of the rule ``complex_conversion`` shown above.
.. _snakefiles-threads:
...
...
@@ -205,8 +252,30 @@ Further, a rule can be given a number of threads to use, i.e.
On a cluster node, Snakemake uses as many cores as available on that node.
Hence, the number of threads used by a rule never exceeds the number of physically available cores on the node.
Note: This behavior is not affected by ``--local-cores``, which only applies to jobs running on the master node.
Snakemake can alter the number of cores available based on command line options. Therefore it is useful to propagate it via the built in variable ``threads`` rather than hardcoding it into the shell command.
In particular, it should be noted that the specified threads have to be seen as a maximum. When Snakemake is executed with fewer cores, the number of threads will be adjusted, i.e. ``threads = min(threads, cores)`` with ``cores`` being the number of cores specified at the command line (option ``--cores``). On a cluster node, Snakemake uses as many cores as available on that node. Hence, the number of threads used by a rule never exceeds the number of physically available cores on the node. Note: This behavior is not affected by ``--local-cores``, which only applies to jobs running on the master node.
In particular, it should be noted that the specified threads have to be seen as a maximum. When Snakemake is executed with fewer cores, the number of threads will be adjusted, i.e. ``threads = min(threads, cores)`` with ``cores`` being the number of cores specified at the command line (option ``--cores``).
Hardcoding a particular maximum number of threads like above is useful when a certain tool has a natural maximum beyond it parallelization won't help to further speed it up.
This is often the case, and should be evaluated carefully for production workflows.
If it is certain that no such maximum exists for a tool, one can instead define threads as a function of the number of cores given to Snakemake:
The number of given cores is globally available in the Snakefile as an attribute of the workflow object: ``workflow.cores``.
Any arithmetic operation can be performed to derive a number of threads from this. E.g., in the above example, we reserve 75% of the given cores for the rule.
Snakemake will always round the calculated value down (while enforcing a minimum of 1 thread).
Starting from version 3.7, threads can also be a callable that returns an ``int`` value. The signature of the callable should be ``callable(wildcards[, input])`` (input is an optional parameter). It is also possible to refer to a predefined variable (e.g, ``threads: threads_max``) so that the number of cores for a set of rules can be changed with one change only by altering the value of the variable ``threads_max``.