- Snakemake now allows for data-dependent conditional re-evaluation of the job DAG via checkpoints. This feature also deprecates the ``dynamic`` flag. See `the docs <https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#data-dependent-conditional-execution>`_.
[5.3.1] - 2018-12-06
====================
Changed
-------
- Various fixed bugs and papercuts, e.g., in group handling, kubernetes execution, singularity support, wrapper and script usage, benchmarking, schema validation.
[5.3.0] - 2018-09-18
====================
...
...
@@ -26,8 +41,10 @@ Added
Changed
-------
- fixed permission issue when using script directive
- fixed various minor bugs and papercuts.
- Fixed permission issue when using the script directive. This is a breaking change
for scripts referring to files relative to the script directory (see the
In the following you find an **incomplete list** of publications making use of Snakemake for their analyses.
Please consider to add your own.
* Doris et al. 2018. `Spt6 is required for the fidelity of promoter selection <https://doi.org/10.1016/j.molcel.2018.09.005>`_. Molecular Cell.
* Karlsson et al. 2018. `Four evolutionary trajectories underlie genetic intratumoral variation in childhood cancer <https://www.nature.com/articles/s41588-018-0131-y>`_. Nature Genetics.
* Planchard et al. 2018. `The translational landscape of Arabidopsis mitochondria <https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gky489/5033161>`_. Nucleic acids research.
* Schult et al. 2018. `Effect of UV irradiation on Sulfolobus acidocaldarius and involvement of the general transcription factor TFB3 in the early UV response <https://academic.oup.com/nar/article/46/14/7179/5047281>`_. Nucleic acids research.
@@ -75,7 +75,7 @@ This entails the pipefail option, which reports errors from within a pipe to out
.. code-block:: bash
set +o pipefile;
set +o pipefail;
to your shell command in the problematic rule.
...
...
@@ -605,3 +605,29 @@ You should have a look if maybe you are missing some library or a certain compil
If everything seems fine, please report to the upstream developers of the failing dependency.
Note that in general it is recommended to install Snakemake via `Conda <https://conda.io>`_ which gives you precompiled packages and the additional benefit of having :ref:`automatic software deployment <integrated_package_management>` integrated into your workflow execution.
How to enable autocompletion for the zsh shell?
-----------------------------------------------
For users of the `Z shell <https://www.zsh.org/>`_ (zsh), just run the following (assuming an activated zsh) to activate autocompletion for snakemake:
.. code-block:: console
compdef _gnu_generic snakemake
Example:
Say you have forgotten how to use the various options starting ``force``, just type the partial match i.e. ``--force`` which results in a list of all potential hits along with a description:
.. code-block:: console
$snakemake --force**pressing tab**
--force -- Force the execution of the selected target or the
--force-use-threads -- Force threads rather than processes. Helpful if shared
--forceall -- Force the execution of the selected (or the first)
@@ -111,11 +111,16 @@ This allows to create links between otherwise separate data analyses.
.. code-block:: python
subworkflow otherworkflow:
workdir: "../path/to/otherworkflow"
snakefile: "../path/to/otherworkflow/Snakefile"
workdir:
"../path/to/otherworkflow"
snakefile:
"../path/to/otherworkflow/Snakefile"
configfile:
"path/to/custom_configfile.yaml"
rule a:
input: otherworkflow("test.txt")
input:
otherworkflow("test.txt")
output: ...
shell: ...
...
...
@@ -123,6 +128,7 @@ Here, the subworkflow is named "otherworkflow" and it is located in the working
The snakefile is in the same directory and called ``Snakefile``.
If ``snakefile`` is not defined for the subworkflow, it is assumed be located in the workdir location and called ``Snakefile``, hence, above we could have left the ``snakefile`` keyword out as well.
If ``workdir`` is not specified, it is assumed to be the same as the current one.
The (optional) definition of a ``configfile`` allows to parameterize the subworkflow as needed.
Files that are output from the subworkflow that we depend on are marked with the ``otherworkflow`` function (see the input of rule a).
This function automatically determines the absolute path to the file (here ``../path/to/otherworkflow/test.txt``).
@@ -426,6 +426,8 @@ Apart from Python scripts, this mechanism also allows you to integrate R_ and R
In the R script, an S4 object named ``snakemake`` analog to the Python case above is available and allows access to input and output files and other parameters. Here the syntax follows that of S4 classes with attributes that are R lists, e.g. we can access the first input file with ``snakemake@input[[1]]`` (note that the first file does not have index ``0`` here, because R starts counting from ``1``). Named input and output files can be accessed in the same way, by just providing the name instead of an index, e.g. ``snakemake@input[["myfile"]]``.
For technical reasons, scripts are executed in ``.snakemake/scripts``. The original script directory is available as ``scriptdir`` in the ``snakemake`` object. A convenience method, ``snakemake@source()``, acts as a wrapper for the normal R ``source()`` function, and can be used to source files relative to the original script directory.
An example external Python script would could look like this:
.. code-block:: python
...
...
@@ -530,8 +532,7 @@ Further, an output file marked as ``temp`` is deleted after all rules that use i
Directories as outputs
----------------------
There are situations where it can be convenient to have directories, rather than files, as outputs of a rule. For example, some tools generate different output files based on which settings they are run with. Rather than covering all these cases with conditional statements in the Snakemake rule, you can let the rule output a directory that contains all the output files regardless of settings. Another use case could be when the number of outputs is large or unknown, say one file per identified species in a metagenomics sample or one file per cluster from a clustering algorithm. If all downstream rules rely on the whole sets of outputs, rather than on the individual species/clusters, then having a directory as an output can be a faster and easier solution compared to using the ``dynamic`` keyword.
As of version 5.2.0, directories as outputs have to be explicitly marked with ``directory``. This is primarily for safety reasons; since all outputs are deleted before a job is executed, we don't want to risk deleting important directories if the user makes some mistake. Marking the output as ``directory`` makes the intent clear, and the output can be safely removed. Another reason comes down to how modification time for directories work. The modification time on a directory changes when a file or a subdirectory is added, removed or renamed. This can easily happen in not-quite-intended ways, such as when Apple macOS or MS Windows add ``.DS_Store`` or ``thumbs.db`` files to store parameters for how the directory contents should be displayed. When the ``directory`` flag is used, then a hidden file called ``.snakemake_timestamp`` is created in the output directory, and the modification time of that file is used when determining whether the rule output is up to date or if it needs to be rerun.
Sometimes it can be convenient to have directories, rather than files, as outputs of a rule. As of version 5.2.0, directories as outputs have to be explicitly marked with ``directory``. This is primarily for safety reasons; since all outputs are deleted before a job is executed, we don't want to risk deleting important directories if the user makes some mistake. Marking the output as ``directory`` makes the intent clear, and the output can be safely removed. Another reason comes down to how modification time for directories work. The modification time on a directory changes when a file or a subdirectory is added, removed or renamed. This can easily happen in not-quite-intended ways, such as when Apple macOS or MS Windows add ``.DS_Store`` or ``thumbs.db`` files to store parameters for how the directory contents should be displayed. When the ``directory`` flag is used a hidden file called ``.snakemake_timestamp`` is created in the output directory, and the modification time of that file is used when determining whether the rule output is up to date or if it needs to be rerun. Always consider if you can't formulate your workflow using normal files before resorting to using ``directory()``.
.. code-block:: python
...
...
@@ -1038,3 +1039,174 @@ Naturally, a pipe output may only have a single consumer.
It is possible to combine explicit group definition as above with pipe outputs.
Thereby, pipe jobs can live within, or (automatically) extend existing groups.
However, the two jobs connected by a pipe may not exist in conflicting groups.
.. _snakefiles-checkpoints:
Data-dependent conditional execution
------------------------------------
From Snakemake 5.4 on, conditional reevaluation of the DAG of jobs based on the content outputs is possible.
The key idea is that rules can be declared as checkpoints, e.g.,
.. code-block:: python
checkpoint somestep:
input:
"samples/{sample}.txt"
output:
"somestep/{sample}.txt"
shell:
"somecommand {input} > {output}"
Snakemake allows to re-evaluate the DAG after the successful execution of every job spawned from a checkpoint.
For this, every checkpoint is registered by its name in a globally available ``checkpoints`` object.
The ``checkpoints`` object can be accessed by :ref:`input functions <snakefiles-input_functions>`.
Assuming that the checkpoint is named ``somestep`` as above, the output files for a particular job can be retrieved with
.. code-block:: python
checkpoints.somestep.get(sample="a").output
Thereby, the ``get`` method throws ``snakemake.exceptions.IncompleteCheckpointException`` if the checkpoint has not yet been executed for these particular wildcard value(s).
Inside an input function, the exception will be automatically handled by Snakemake, and leads to a re-evaluation after the checkpoint has been successfully passed.
To illustrate the possibilities of this mechanism, consider the following complete example:
.. code-block:: python
# a target rule to define the desired final output
rule all:
input:
"aggregated/a.txt",
"aggregated/b.txt"
# the checkpoint that shall trigger re-evaluation of the DAG
with open(checkpoints.somestep.get(sample=wildcards.sample).output[0]) as f:
if f.read().strip() == "a":
return "post/{sample}.txt"
else:
return "alt/{sample}.txt"
rule aggregate:
input:
aggregate_input
output:
"aggregated/{sample}.txt"
shell:
"touch {output}"
As can be seen, the rule aggregate uses an input function.
Inside the function, we first retrieve the output files of the checkpoint ``somestep`` with the wildcards, passing through the value of the wildcard sample.
Upon execution, if the checkpoint is not yet complete, Snakemake will record ``somestep`` as a direct dependency of the rule ``aggregate``.
Once ``somestep`` has finished for a given sample, the input function will automatically be re-evaluated and the method ``get`` will no longer return an exception.
Instead, the output file will be opened, and depending on its contents either ``"post/{sample}.txt"`` or ``"alt/{sample}.txt"`` will be returned by the input function.
This way, the DAG becomes conditional on some produced data.
It is also possible to use checkpoints for cases where the output files are unknown before execution.
A typical example is a clustering process with an unknown number of clusters, where each cluster shall be saved into a separate file.
Consider the following example:
.. code-block:: python
# a target rule to define the desired final output
rule all:
input:
"aggregated/a.txt",
"aggregated/b.txt"
# the checkpoint that shall trigger re-evaluation of the DAG
checkpoint clustering:
input:
"samples/{sample}.txt"
output:
clusters=directory("clustering/{sample}")
shell:
"mkdir clustering/{wildcards.sample}; "
"for i in 1 2 3; do echo $i > clustering/{wildcards.sample}/$i.txt; done"
which automatically unpacks the wildcards as keyword arguments (this is standard python argument unpacking).
If the checkpoint has not yet been executed, accessing ``checkpoints.clustering.get(**wildcards)`` ensure that Snakemake records the checkpoint as a direct dependency of the rule ``aggregate``.
Upon completion of the checkpoint, the input function is re-evaluated, and the code beyond its first line is executed.
Here, we retrieve the values of the wildcard ``i`` based on all files named ``{i}.txt`` in the output directory of the checkpoint.
These values are then used to expand the pattern ``"post/{sample}/{i}.txt"``, such that the rule ``intermediate`` is executed for each of the determined clusters.
This mechanism can be used to replace the use of the :ref:`dynamic-flag <snakefiles-dynamic_files>` which will be deprecated in Snakemake 6.0.
@@ -66,7 +66,7 @@ The ``benchmark`` directive takes a string that points to the file where benchma
Similar to output files, the path can contain wildcards (it must be the same wildcards as in the output files).
When a job derived from the rule is executed, Snakemake will measure the wall clock time and memory usage (in MiB) and store it in the file in tab-delimited format.
It is possible to repeat a benchmark multiple times in order to get a sense for the variability of the measurements.
This can be done by annotating the benchmark file, e.g., with ``benchmark("benchmarks/{sample}.bwa.benchmark.txt", 3)`` Snakemake can be told to run the job three times.
This can be done by annotating the benchmark file, e.g., with ``repeat("benchmarks/{sample}.bwa.benchmark.txt", 3)`` Snakemake can be told to run the job three times.
The repeated measurements occur as subsequent lines in the tab-delimited benchmark file.