- Directory outputs have to marked with `directory`. This ensures proper handling of timestamps and cleanup. This is a breaking change. Implemented by Rasmus Ågren.
- Fixed kubernetes tests, fixed kubernetes volume handling. Implemented by Andrew Schriefer.
- jinja2 and networkx are not optional dependencies when installing via pip.
- When conda or singularity directives are used and the corresponding CLI flags are not specified, the user is notified at the beginning of the log output.
- Fixed numerous small bugs and papercuts and extended documentation.
# [5.1.5] - 2018-06-24
## Changed
- fixed missing version info in docker image.
- several minor fixes to EGA support.
# [5.1.4] - 2018-05-28
## Added
- Allow `category` to be set.
## Changed
- Various cosmetic changes to reports.
- Fixed encoding issues in reports.
# [5.1.3] - 2018-05-22
## Changed
- Fixed various bugs in job groups, shadow directive, singularity directive, and more.
# [5.1.2] - 2018-05-18
## Changed
- Fixed a bug in the report stylesheet.
# [5.1.0] - 2018-05-17
## Added
- A new framework for self-contained HTML reports, including results, statistics and topology information. In future releases this will be further extended.
- A new utility snakemake.utils.validate() which allows to validate config and pandas data frames using JSON schemas.
- Two new flags --cleanup-shadow and --cleanup-conda to clean up old unused conda and shadow data.
## Changed
- Benchmark repeats are now specified inside the workflow via a new flag repeat().
- Command line interface help has been refactored into groups for better readability.
# [5.0.0] - 2018-05-11
# Added
- Group jobs for reduced queuing and network overhead, in particular with short running jobs.
- Output files can be marked as pipes, such that producing and consuming job are executed simultaneously and interfomation is transferred directly without using disk.
- Command line flags to clean output files.
- Command line flag to list files in working directory that are not tracked by Snakemake.
# Changes
- Fix of --default-remote-prefix in case of input functions returning lists or dicts.
- Scheduler no longer prefers jobs with many downstream jobs.
# [4.8.1] - 2018-04-25
# Added
- Allow URLs for the conda directive.
# Changed
- Various minor updates in the docs.
- Several bug fixes with remote file handling.
- Fix ImportError occuring with script directive.
- Use latest singularity.
- Improved caching for file existence checks. We first check existence of parent directories and cache these results. By this, large parts of the generated FS tree can be pruned if files are not yet present. If files are present, the overhead is minimal, since the checks for the parents are cached.
- Various minor bug fixes.
# [4.8.0] - 2018-03-13
### Added
- Integration with CWL: the `cwl` directive allows to use CWL tool definitions in addition to shell commands or Snakemake wrappers.
- A global `singularity` directive allows to define a global singularity container to be used for all rules that don't specify their own.
- Singularity and Conda can now be combined. This can be used to specify the operating system (via singularity), and the software stack (via conda), without the overhead of creating specialized container images for workflows or tasks.
# [4.7.0] - 2018-02-19
### Changed
- Speedups when calculating dry-runs.
- Speedups for workflows with many rules when calculating the DAG.
- Accept SIGTERM to gracefully finish all running jobs and exit.
- Various minor bug fixes.
# [4.6.0] - 2018-02-06
### Changed
- Log files can now be used as input files for other rules.
- Adapted to changes in Kubernetes client API.
- Fixed minor issues in --archive option.
- Search path order in scripts was changed to fix a bug with leaked packages from root env when using script directive together with conda.
# [4.5.1] - 2018-02-01
### Added
- Input and output files can now tag pathlib objects.
### Changed
- Various minor bug fixes.
# [4.5.0] - 2018-01-18
### Added
- iRODS remote provider
### Changed
- Bug fix in shell usage of scripts and wrappers.
- Bug fixes for cluster execution, --immediate-submit and subworkflows.
## [4.4.0] - 2017-12-21
### Added
- A new shadow mode (minimal) that only symlinks input files has been added.
### Changed
- The default shell is now bash on linux and macOS. If bash is not installed, we fall back to sh. Previously, Snakemake used the default shell of the user, which defeats the purpose of portability. If the developer decides so, the shell can be always overwritten using shell.executable().
- Snakemake now requires Singularity 2.4.1 at least (only when running with --use-singularity).
- HTTP remote provider no longer automatically unpacks gzipped files.
- Fixed various smaller bugs.
## [4.3.1] - 2017-11-16
### Added
- List all conda environments with their location on disk via --list-conda-envs.
### Changed
- Do not clean up shadow on dry-run.
- Allow R wrappers.
## [4.3.0] - 2017-10-27
### Added
- GridFTP remote provider. This is a specialization of the GFAL remote provider that uses globus-url-copy to download or upload files.
### Changed
- Scheduling and execution mechanisms have undergone a major revision that removes several potential (but rare) deadlocks.
- Several bugs and corner cases of the singularity support have been fixed.
- Snakemake now requires singularity 2.4 at least.
## [4.2.0] - 2017-10-10
### Added
- Support for executing jobs in per-rule singularity images. This is meant as an alternative to the conda directive (see docs), providing even more guarantees for reproducibility.
### Changed
- In cluster mode, jobs that are still running after Snakemake has been killed are automatically resumed.
- Various fixes to GFAL remote provider.
- Fixed --summary and --list-code-changes.
- Many other small bug fixes.
## [4.1.0] - 2017-09-26
### Added
- Support for configuration profiles. Profiles allow to specify default options, e.g., a cluster
submission command. They can be used via 'snakemake --profile myprofile'. See the docs for details.
- GFAL remote provider. This allows to use GridFTP, SRM and any other protocol supported by GFAL for remote input and output files.
- Added --cluster-status flag that allows to specify a command that returns jobs status.
### Changed
- The scheduler now tries to get rid of the largest temp files first.
- The Docker image used for kubernetes support can now be configured at the command line.
- Rate-limiting for cluster interaction has been unified.
- S3 remote provider uses boto3.
- Resource functions can now use an additional `attempt` parameter, that contains the number of times this job has already been tried.
- Various minor fixes.
## [4.0.0] - 2017-07-24
### Added
- Cloud computing support via Kubernetes. Snakemake workflows can be executed transparently
in the cloud, while storing input and output files within the cloud storage
(e.g. S3 or Google Storage). I.e., this feature does not need a shared filesystem
between the cloud notes, and thereby makes the setup really simple.
- WebDAV remote file support: Snakemake can now read and write from WebDAV. Hence,
it can now, e.g., interact with Nextcloud or Owncloud.
- Support for default remote providers: define a remote provider to implicitly
use for all input and output files.
- Added an option to only create conda environments instead of executing the workflow.
### Changed
- The number of files used for the metadata tracking of Snakemake (e.g., code, params, input changes) in the .snakemake directory has been reduced by a factor of 10, which should help with NFS and IO bottlenecks. This is a breaking change in the sense that Snakemake 4.x won't see the metadata of workflows executed with Snakemake 3.x. However, old metadata won't be overwritten, so that you can always go back and check things by installing an older version of Snakemake again.
- The google storage (GS) remote provider has been changed to use the google SDK.
This is a breaking change, since the remote provider invocation has been simplified (see docs).
- Due to WebDAV support (which uses asyncio), Snakemake now requires Python 3.5 at least.
- Various minor bug fixes (e.g. for dynamic output files).
## [3.13.3] - 2017-06-23
### Changed
- Fix a followup bug in Namedlist where a single item was not returned as string.
## [3.13.2] - 2017-06-20
### Changed
- The --wrapper-prefix flag now also affects where the corresponding environment definition is fetched from.
- Fix bug where empty output file list was recognized as containing duplicates (issue #574).
## [3.13.1] - 2017-06-20
### Changed
- Fix --conda-prefix to be passed to all jobs.
- Fix cleanup issue with scripts that fail to download.
## [3.13.0] - 2017-06-12
### Added
- An NCBI remote provider. By this, you can seamlessly integrate any NCBI resouce (reference genome, gene/protein sequences, ...) as input file.
### Changed
- Snakemake now detects if automatically generated conda environments have to be recreated because the workflow has been moved to a new path.
- Remote functionality has been made more robust, in particular to avoid race conditions.
-`--config` parameter evaluation has been fixed for non-string types.
- The Snakemake docker container is now based on the official debian image.
## [3.12.0] - 2017-05-09
### Added
- Support for RMarkdown (.Rmd) in script directives.
- New option --debug-dag that prints all decisions while building the DAG of jobs. This helps to debug problems like cycles or unexpected MissingInputExceptions.
- New option --conda-prefix to specify the place where conda environments are stored.
### Changed
- Benchmark files now also include the maximal RSS and VMS size of the Snakemake process and all sub processes.
- Speedup conda environment creation.
- Allow specification of DRMAA log dir.
- Pass cluster config to subworkflow.
## [3.11.2] - 2017-03-15
### Changed
- Fixed fix handling of local URIs with the wrapper directive.
## [3.11.1] - 2017-03-14
### Changed
- --touch ignores missing files
- Fixed handling of local URIs with the wrapper directive.
## [3.11.0] - 2017-03-08
### Added
- Param functions can now also refer to threads.
### Changed
- Improved tutorial and docs.
- Made conda integration more robust.
- None is converted to NULL in R scripts.
## [3.10.2] - 2017-02-28
### Changed
- Improved config file handling and merging.
- Output files can be referred in params functions (i.e. lambda wildcards, output: ...)
- Improved conda-environment creation.
- Jobs are cached, leading to reduced memory footprint.
- Fixed subworkflow handling in input functions.
## [3.10.0] - 2017-01-18
### Added
- Workflows can now be archived to a tarball with `snakemake --archive my-workflow.tar.gz`. The archive contains all input files, source code versioned with git and all software packages that are defined via conda environments. Hence, the archive allows to fully reproduce a workflow on a different machine. Such an archive can be uploaded to Zenodo, such that your workflow is secured in a self-contained, executable way for the future.
### Changed
- Improved logging.
- Reduced memory footprint.
- Added a flag to automatically unpack the output of input functions.
- Improved handling of HTTP redirects with remote files.
- Improved exception handling with DRMAA.
- Scripts referred by the script directive can now use locally defined external python modules.
## [3.9.1] - 2016-12-23
### Added
- Jobs can be restarted upon failure (--restart-times).
### Changed
- The docs have been restructured and improved. Now available under snakemake.readthedocs.org.
- Changes in scripts show up with --list-code-changes.
- Duplicate output files now cause an error.
- Various bug fixes.
## [3.9.0] - 2016-11-15
### Added
- Ability to define isolated conda software environments (YAML) per rule. Environments will be deployed by Snakemake upon workflow execution.
- Command line argument --wrapper-prefix in order to overwrite the default URL for looking up wrapper scripts.
### Changed
- --summary now displays the log files correspoding to each output file.
- Fixed hangups when using run directive and a large number of jobs
- Fixed pickling errors with anonymous rules and run directive.
- Various small bug fixes
## [3.8.2] - 2016-09-23
### Changed
- Add missing import in rules.py.
- Use threading only in cluster jobs.
## [3.8.1] - 2016-09-14
### Changed
- Snakemake now warns when using relative paths starting with "./".
- The option -R now also accepts an empty list of arguments.
- Bug fix when handling benchmark directive.
- Jobscripts exit with code 1 in case of failure. This should improve the error messages of cluster system.
- Fixed a bug in SFTP remote provider.
## [3.8.0] - 2016-08-26
### Added
- Wildcards can now be constrained by rule and globally via the new `wildcard_constraints` directive (see the [docs](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-wildcards)).
- Subworkflows now allow to overwrite their config file via the configfile directive in the calling Snakefile.
- A method `log_fmt_shell` in the snakemake proxy object that is available in scripts and wrappers allows to obtain a formatted string to redirect logging output from STDOUT or STDERR.
- Functions given to resources can now optionally contain an additional argument `input` that refers to the input files.
- Functions given to params can now optionally contain additional arguments `input` (see above) and `resources`. The latter refers to the resources.
- It is now possible to let items in shell commands be automatically quoted (see the [docs](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-rules)). This is usefull when dealing with filenames that contain whitespaces.
### Changed
- Snakemake now deletes output files before job exection. Further, it touches output files after job execution. This solves various problems with slow NFS filesystems.
- A bug was fixed that caused dynamic output rules to be executed multiple times when forcing their execution with -R.
- A bug causing double uploads with remote files was fixed. Various additional bug fixes related to remote files.
- Various minor bug fixes.
## [3.7.1] - 2016-05-16
### Changed
- Fixed a missing import of the multiprocessing module.
## [3.7.0] - 2016-05-05
### Added
- The entries in `resources` and the `threads` job attribute can now be callables that must return `int` values.
- Multiple `--cluster-config` arguments can be given to the Snakemake command line. Later one override earlier ones.
- In the API, multiple `cluster_config` paths can be given as a list, alternatively to the previous behaviour of expecting one string for this parameter.
- When submitting cluster jobs (either through `--cluster` or `--drmaa`), you can now use `--max-jobs-per-second` to limit the number of jobs being submitted (also available through Snakemake API). Some cluster installations have problems with too many jobs per second.
- Wildcard values are now printed upon job execution in addition to input and output files.
### Changed
- Fixed a bug with HTTP remote providers.
## [3.6.1] - 2016-04-08
### Changed
- Work around missing RecursionError in Python < 3.5
- Improved conversion of numpy and pandas data structures to R scripts.
- Fixed locking of working directory.
## [3.6.0] - 2016-03-10
### Added
- onstart handler, that allows to add code that shall be only executed before the actual workflow execution (not on dryrun).
- Parameters defined in the cluster config file are now accessible in the job properties under the key "cluster".
- The wrapper directive can be considered stable.
### Changed
- Allow to use rule/job parameters with braces notation in cluster config.
- Show a proper error message in case of recursion errors.
- Remove non-empty temp dirs.
- Don't set the process group of Snakemake in order to allow kill signals from parent processes to be propagated.
- Fixed various corner case bugs.
- The params directive no longer converts a list ``l`` implicitly to ``" ".join(l)``.
## [3.5.5] - 2016-01-23
### Added
- New experimental wrapper directive, which allows to refer to re-usable [wrapper scripts](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-wrappers). Wrappers are provided in the [Snakemake Wrapper Repository](https://bitbucket.org/snakemake/snakemake-wrappers).
- David Koppstein implemented two new command line options to constrain the execution of the DAG of job to sub-DAGs (--until and --omit-from).
### Changed
- Fixed various bugs, e.g. with shadow jobs and --latency-wait.
## [3.5.4] - 2015-12-04
### Changed
- The params directive now fully supports non-string parameters. Several bugs in the remote support were fixed.
## [3.5.3] - 2015-11-24
### Changed
- The missing remote module was added to the package.
## [3.5.2] - 2015-11-24
### Added
- Support for easy integration of external R and Python scripts via the new [script directive](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-external-scripts).
- Chris Tomkins-Tinch has implemented support for remote files: Snakemake can now handle input and output files from Amazon S3, Google Storage, FTP, SFTP, HTTP and Dropbox.
- Simon Ye has implemented support for sandboxing jobs with [shadow rules](https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-shadow-rules).
### Changed
- Manuel Holtgrewe has fixed dynamic output files in combination with mutliple wildcards.
- It is now possible to add suffixes to all shell commands with shell.suffix("mysuffix").
- Job execution has been refactored to spawn processes only when necessary, resolving several problems in combination with huge workflows consisting of thousands of jobs and reducing the memory footprint.
- In order to reflect the new collaborative development model, Snakemake has moved from my personal bitbucket account to http://snakemake.bitbucket.org.
## [3.4.2] - 2015-09-12
### Changed
- Willem Ligtenberg has reduced the memory usage of Snakemake.
- Per Unneberg has improved config file handling to provide a more intuitive overwrite behavior.
- Simon Ye has improved the test suite of Snakemake and helped with setting up continuous integration via Codeship.
- The cluster implementation has been rewritten to use only a single thread to wait for jobs. This avoids failures with large numbers of jobs.
- Benchmarks are now writing tab-delimited text files instead of JSON.
- Snakemake now always requires to set the number of jobs with -j when in cluster mode. Set this to a high value if your cluster does not have restrictions.
- The Snakemake Conda package has been moved to the bioconda channel.
- The handling of Symlinks was improved, which made a switch to Python 3.3 as the minimum required Python version necessary.
## [3.4.1] - 2015-08-05
### Changed
- This release fixes a bug that caused named input or output files to always be returned as lists instead of single files.
## [3.4] - 2015-07-18
### Added
- This release adds support for executing jobs on clusters in synchronous mode (e.g. qsub -sync). Thanks to David Alexander for implementing this.
- There is now vim syntax highlighting support (thanks to Jay Hesselberth).
- Snakemake is now available as Conda package.
### Changed
- Lots of bugs have been fixed. Thanks go to e.g. David Koppstein, Marcel Martin, John Huddleston and Tao Wen for helping with useful reports and debugging.
See [here](https://bitbucket.org/snakemake/snakemake/wiki/News-Archive) for older changes.
You likely also want to use google storage for reading and writing files.
For this, you will additionally need to authenticate with your google cloud account via
.. code-block:: console
$ gcloud auth application-default login
This enables Snakemake to access google storage in order to check existence and modification dates of files.
Now, Snakemake is ready to use your cluster.
**Important:** After finishing your work, do not forget to delete the cluster with
...
...
@@ -272,6 +286,33 @@ To visualize the whole DAG regardless of the eventual presence of files, the ``f
Of course the visual appearance can be modified by providing further command line arguments to ``dot``.
.. _cwl_export:
----------
CWL export
----------
Snakemake workflows can be exported to `CWL <http://www.commonwl.org/>`_, such that they can be executed in any `CWL-enabled workflow engine <https://www.commonwl.org/#Implementations>`_.
Since, CWL is less powerful for expressing workflows than Snakemake (most importantly Snakemake offers more flexible scatter-gather patterns, since full Python can be used), export works such that every Snakemake job is encoded into a single step in the CWL workflow.
Moreover, every step of that workflow calls Snakemake again to execute the job. The latter enables advanced Snakemake features like scripts, benchmarks and remote files to work inside CWL.
So, when exporting keep in mind that the resulting CWL file can become huge, depending on the number of jobs in your workflow.
To export a Snakemake workflow to CWL, simply run
.. code-block:: console
$ snakemake --export-cwl workflow.cwl
The resulting workflow will by default use the `Snakemake docker image <https://quay.io/repository/snakemake/snakemake>`_ for every step, but this behavior can be overwritten via the CWL execution environment.
Then, the workflow can be executed in the same working directory with, e.g.,
.. code-block:: console
$ cwltool workflow.cwl
Note that due to limitations in CWL, it seems currently impossible to avoid that all target files (output files of target jobs), are written directly to the workdir, regardless of their relative paths in the Snakefile.
Note that export is impossible in case the workflow contains :ref:`dynamic output files <snakefiles-dynamic_files>` or output files with absolute paths.
@@ -30,6 +30,7 @@ Workflows are described via a human readable, Python based language.
They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition.
Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.
To get a first impression, see our `introductory slides <https://slides.com/johanneskoester/snakemake-short>`_.
.. _manual-quick_example:
...
...
@@ -65,6 +66,7 @@ Rules describe how to create **output files** from **input files**.
* Snakemake can automatically deploy required software dependencies of a workflow using `Conda <https://conda.io>`_ or `Singularity <http://singularity.lbl.gov/>`_.
* Snakemake can use Amazon S3, Google Storage, Dropbox, FTP, WebDAV, SFTP and iRODS to access input or output files and further access input files via HTTP and HTTPS.
.. _main-getting-started:
---------------
...
...
@@ -72,7 +74,7 @@ Getting started
---------------
News about Snakemake are published via `Twitter <https://twitter.com/search?l=&q=%23snakemake%20from%3Ajohanneskoester>`_.
To get started, consider the :ref:`tutorial`, the `introductory slides <http://slides.com/johanneskoester/snakemake-tutorial-2016>`_, and the :ref:`FAQ <project_info-faq>`.
To get started, please do the :ref:`tutorial`, and see the :ref:`FAQ <project_info-faq>`.
.. _main-support:
...
...
@@ -121,12 +123,18 @@ Resources
Publications using Snakemake
----------------------------
In the following you find an incomplete list of publications making use of Snakemake for their analyses.
In the following you find an **incomplete list** of publications making use of Snakemake for their analyses.
Please consider to add your own.
* Karlsson et al. 2018. `Four evolutionary trajectories underlie genetic intratumoral variation in childhood cancer <https://www.nature.com/articles/s41588-018-0131-y>`_. Nature Genetics.
* Planchard et al. 2018. `The translational landscape of Arabidopsis mitochondria <https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gky489/5033161>`_. Nucleic acids research.
* Schult et al. 2018. `Effect of UV irradiation on Sulfolobus acidocaldarius and involvement of the general transcription factor TFB3 in the early UV response <https://academic.oup.com/nar/article/46/14/7179/5047281>`_. Nucleic acids research.
* Goormaghtigh et al. 2018. `Reassessing the Role of Type II Toxin-Antitoxin Systems in Formation of Escherichia coli Type II Persister Cells <https://mbio.asm.org/content/mbio/9/3/e00640-18.full.pdf>`_. mBio.
* Ramirez et al. 2018. `Detecting macroecological patterns in bacterial communities across independent studies of global soils <https://www.nature.com/articles/s41564-017-0062-x>`_. Nature microbiology.
* Amato et al. 2018. `Evolutionary trends in host physiology outweigh dietary niche in structuring primate gut microbiomes <https://www.nature.com/articles/s41396-018-0175-0>`_. The ISME journal.
* Uhlitz et al. 2017. `An immediate–late gene expression module decodes ERK signal duration <http://msb.embopress.org/content/13/5/928>`_. Molecular Systems Biology.
* Akkouche et al. 2017. `Piwi Is Required during Drosophila Embryogenesis to License Dual-Strand piRNA Clusters for Transposon Repression in Adult Ovaries <http://www.sciencedirect.com/science/article/pii/S1097276517302071>`_. Molecular Cell.
* Beatty et al. 2017. `Giardia duodenalis induces pathogenic dysbiosis of human intestinal microbiota biofilms <>`_. International Journal for Parasitology.
* Beatty et al. 2017. `Giardia duodenalis induces pathogenic dysbiosis of human intestinal microbiota biofilms <https://www.ncbi.nlm.nih.gov/pubmed/28237889>`_. International Journal for Parasitology.
* Meyer et al. 2017. `Differential Gene Expression in the Human Brain Is Associated with Conserved, but Not Accelerated, Noncoding Sequences <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5400397/>`_. Molecular Biology and Evolution.
* Lonardo et al. 2017. `Priming of soil organic matter: Chemical structure of added compounds is more important than the energy content <http://www.sciencedirect.com/science/article/pii/S0038071716304539>`_. Soil Biology and Biochemistry.
* Beisser et al. 2017. `Comprehensive transcriptome analysis provides new insights into nutritional strategies and phylogenetic relationships of chrysophytes <https://peerj.com/articles/2832/>`_. PeerJ.
Snakemake itself is plain Python, hence the compiler error must come from one of the dependencies, like e.g., datrie.
You should have a look if maybe you are missing some library or a certain compiler package.
If everything seems fine, please report to the upstream developers of the failing dependency.
Note that in general it is recommended to install Snakemake via `Conda <https://conda.io>`_ which gives you precompiled packages and the additional benefit of having :ref:`automatic software deployment <integrated_package_management>` integrated into your workflow execution.
With Snakemake 5.1, it is possible to validate both types of configuration via `JSON schemas <http://json-schema.org>`_.
The function ``snakemake.utils.validate`` takes a loaded configuration (a config dictionary or a Pandas data frame) and validates it with a given JSON schema.
Thereby, the schema can be provided in JSON or YAML format.
Thereby, the schema can be provided in JSON or YAML format. Also, by using the defaults property it is possible to populate entries with default values. See `jsonschema FAQ on setting default values <https://python-jsonschema.readthedocs.io/en/latest/faq/>`_ for details.
In case of the data frame, the schema should model the record that is expected in each row of the data frame.
In the following example,
...
...
@@ -113,12 +113,17 @@ the schema for validating the samples data frame looks like this:
condition:
type: string
description: sample condition that will be compared during differential expression analysis (e.g. a treatment, a tissue time, a disease)
case:
type: boolean
default: true
description: boolean that indicates if sample is case or control
required:
- sample
- condition
Here, in case the case column is missing, the validate function will
@@ -79,9 +79,11 @@ with the following `environment definition <http://conda.pydata.org/docs/using/e
- r=3.3.1
- r-ggplot2=2.1.0
The path to the environment definition is interpreted as **relative to the Snakefile that contains the rule** (unless it is an absolute path, which is discouraged).
Snakemake will store the environment persistently in ``.snakemake/conda/$hash`` with ``$hash`` being the MD5 hash of the environment definition file content. This way, updates to the environment definition are automatically detected.
Note that you need to clean up environments manually for now. However, in many cases they are lightweight and consist of symlinks to your central conda installation.
Modularization in Snakemake comes at different levels.
1. The most fine-grained level are wrappers. They are available and can be published atthe `Snakemake Wrapper Repository <https://snakemake-wrappers.readthedocs.io>`_. These wrappers can then be composed and customized according to your needs, by copying skeleton rules into your workflow. In combination with conda integration, wrappers also automatically deploy the needed software dependencies into isolated environments.
1. The most fine-grained level are wrappers. They are available and can be published atthe `Snakemake Wrapper Repository`_. These wrappers can then be composed and customized according to your needs, by copying skeleton rules into your workflow. In combination with conda integration, wrappers also automatically deploy the needed software dependencies into isolated environments.
2. For larger, reusable parts that shall be integrated into a common workflow, it is recommended to write small Snakefiles and include them into a master Snakefile via the include statement. In such a setup, all rules share a common config file.
3. The third level of separation are subworkflows. Importantly, these are rather meant as links between otherwise separate data analyses.
...
...
@@ -35,13 +37,13 @@ For example
"0.0.8/bio/samtools_sort"
Refers to the wrapper ``"0.0.8/bio/samtools_sort"`` to create the output from the input.
Snakemake will automatically download the wrapper from the `Snakemake Wrapper Repository <https://bitbucket.org/snakemake/snakemake-wrappers>`_.
Snakemake will automatically download the wrapper from the `Snakemake Wrapper Repository`_.
Thereby, 0.0.8 can be replaced with the git version tag you want to use, or a commit id (see `here <https://bitbucket.org/snakemake/snakemake-wrappers/commits>`_).
This ensures reproducibility since changes in the wrapper implementation won't be propagated automatically to your workflow.
Alternatively, e.g., for development, the wrapper directive can also point to full URLs, including URLs to local files with absolute paths ``file://`` or relative paths ``file:``.
Examples for each wrapper can be found in the READMEs located in the wrapper subdirectories at the `Snakemake Wrapper Repository <https://bitbucket.org/snakemake/snakemake-wrappers>`_.
Examples for each wrapper can be found in the READMEs located in the wrapper subdirectories at the `Snakemake Wrapper Repository`_.
The `Snakemake Wrapper Repository <https://bitbucket.org/snakemake/snakemake-wrappers>`_ is meant as a collaborative project and pull requests are very welcome.
The `Snakemake Wrapper Repository`_ is meant as a collaborative project and pull requests are very welcome.
@@ -570,8 +570,6 @@ Shadow rules result in each execution of the rule to be run in isolated temporar
By setting ``shadow: "shallow"``, the top level files and directories are symlinked, so that any relative paths in a subdirectory will be real paths in the filesystem. The setting ``shadow: "full"`` fully shadows the entire subdirectory structure of the current workdir. The setting ``shadow: "minimal"`` only symlinks the inputs to the rule. Once the rule successfully executes, the output file will be moved if necessary to the real path as indicated by ``output``.
Shadow directories are stored one per rule execution in ``.snakemake/shadow/``, and are cleared on subsequent snakemake invocations unless the ``--keep-shadow`` command line argument is used.
Typically, you will not need to modify your rule for compatibility with ``shadow``, unless you reference parent directories relative to your workdir in a rule.
.. code-block:: python
...
...
@@ -582,6 +580,8 @@ Typically, you will not need to modify your rule for compatibility with ``shadow
Shadow directories are stored one per rule execution in ``.snakemake/shadow/``, and are cleared on successful execution. Consider running with the ``--cleanup-shadow`` argument every now and then to remove any remaining shadow directories from aborted jobs. The base shadow directory can be changed with the ``--shadow-prefix`` command line argument.
@@ -65,47 +65,10 @@ We activate benchmarking for the rule ``bwa_map``:
The ``benchmark`` directive takes a string that points to the file where benchmarking results shall be stored.
Similar to output files, the path can contain wildcards (it must be the same wildcards as in the output files).
When a job derived from the rule is executed, Snakemake will measure the wall clock time and memory usage (in MiB) and store it in the file in tab-delimited format.
With the command line flag ``--benchmark-repeats``, Snakemake can be instructed to perform repetitive measurements by executing benchmark jobs multiple times.
It is possible to repeat a benchmark multiple times in order to get a sense for the variability of the measurements.
This can be done by annotating the benchmark file, e.g., with ``benchmark("benchmarks/{sample}.bwa.benchmark.txt", 3)`` Snakemake can be told to run the job three times.
The repeated measurements occur as subsequent lines in the tab-delimited benchmark file.
We can include the benchmark results into our report:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
Benchmark results for BWA can be found in the tables T2_.
""", output[0], **input)
We use the ``expand`` function to collect the benchmark files for all samples.
Here, we directly provide names for the input files.
In particular, we can also name the whole list of benchmark files returned by the ``expand`` function as ``T2``.
When invoking the ``report`` function, we just unpack ``input`` into keyword arguments (resulting in ``T1`` and ``T2``).
In the text, we refer with ``T2_`` to the list of benchmark files.
Exercise
........
* Re-execute the workflow and benchmark ``bwa_map`` with 3 repeats. Open the report and see how the list of benchmark files is presented in the HTML report.
Modularization
::::::::::::::
...
...
@@ -130,71 +93,6 @@ Exercise
* Put the read mapping related rules into a separate Snakefile and use the ``include`` directive to make them available in our example workflow again.
Using custom scripts
::::::::::::::::::::
Using the ``run`` directive as above is only reasonable for short Python scripts.
As soon as your script becomes larger, it is reasonable to separate it from the
workflow definition.
For this purpose, Snakemake offers the ``script`` directive.
Using this, the ``report`` rule from above could instead look like this:
Although Snakemake workflows are already self-documenting to a certain degree, it is often useful to summarize the obtained results and performed steps in a comprehensive **report**.
With Snakemake, such reports can be composed easily with the built-in ``report`` function.
It is best practice to create reports in a separate rule that takes all desired results as input files and provides a **single HTML file as output**.
Usually, a workflow not only consists of invoking various tools, but also contains custom code to e.g. calculate summary statistics or create plots.
While Snakemake also allows you to directly :ref:`write Python code inside a rule <.. _snakefiles-rules>`_, it is usually reasonable to move such logic into separate scripts.
For this purpose, Snakemake offers the ``script`` directive.
Add the following rule to your Snakefile:
.. code:: python
rule report:
rule plot_quals:
input:
"calls/all.vcf"
output:
"report.html"
run:
from snakemake.utils import report
with open(input[0]) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
"plots/quals.svg"
script:
"scripts/plot-quals.py"
This resulted in {n_calls} variants (see Table T1_).
""", output[0], T1=input[0])
This rule shall generate a histogram of the quality scores that have been assigned to the variant calls in the file ``calls/all.vcf``.
The actual Python code to generate the plot is hidden in the script ``scripts/plot-quals.py``.
Script paths are always relative to the referring Snakefile.
In the script, all properties of the rule like ``input``, ``output``, ``wildcards``, etc. are available as attributes of a global ``snakemake`` object.
Create the file ``scripts/plot-quals.py``, with the following content:
.. sidebar:: Note
.. code:: python
The run directive can be seen as a Python function with the arguments ``input``, ``output``, ``wildcards``, etc..
Hence, other than with the shell directive before, there is no need to enclose those objects in braces.
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
from pysam import VariantFile
First, we notice that this rule does not entail a shell command.
Instead, we use the ``run`` directive, which is followed by plain Python code.
Similar to the shell case, we have access to ``input`` and ``output`` files, which we can handle as plain Python objects.
quals = [record.qual for record in VariantFile(snakemake.input[0])]
plt.hist(quals)
We go through the ``run`` block line by line.
First, we import the ``report`` function from ``snakemake.utils``.
Second, we open the VCF file by accessing it via its index in the input files (i.e. ``input[0]``), and count the number of non-header lines (which is equivalent to the number of variant calls).
Of course, this is only a silly example of what to do with variant calls.
Third, we create the report using the ``report`` function.
The function takes a string that contains RestructuredText_ markup.
In addition, we can use the familiar braces notation to access any Python variables (here the ``samples`` and ``n_calls`` variables we have defined before).
The second argument of the ``report`` function is the path were the report will be stored (the function creates a single HTML file).
Then, report expects any number of keyword arguments referring to files that shall be embedded into the report.
Technically, this means that the file will be stored as a Base64 encoded `data URI`_ within the HTML file, making reports entirely self-contained.
Importantly, you can refer to the files from within the report via the given keywords followed by an underscore (here ``T1_``).
Hence, reports can be used to semantically connect and explain the obtained results.
plt.savefig(snakemake.output[0])
When having many result files, it is sometimes handy to define the names already in the list of input files and unpack these into keyword arguments as follows:
.. code:: python
.. sidebar:: Note
report("""...""", output[0], **input)
It is best practice to use the script directive whenever an inline code block would have
more than a few lines of code.
Further, you can add meta data in the form of any string that will be displayed in the footer of the report, e.g.
Although there are other strategies to invoke separate scripts from your workflow
(e.g., invoking them via shell commands), the benefit of this is obvious:
the script logic is separated from the workflow logic (and can be even shared between workflows),
but **boilerplate code like the parsing of command line arguments in unnecessary**.
.. code:: python
Apart from Python scripts, it is also possible to use R scripts. In R scripts,
an S4 object named ``snakemake`` analog to the Python case above is available and
allows access to input and output files and other parameters. Here the syntax
follows that of S4 classes with attributes that are R lists, e.g. we can access
the first input file with ``snakemake@input[[1]]`` (note that the first file does
not have index 0 here, because R starts counting from 1). Named input and output
files can be accessed in the same way, by just providing the name instead of an
index, e.g. ``snakemake@input[["myfile"]]``.
report("""...""", output[0], metadata="Author: Johannes Köster (koester@jimmy.harvard.edu)", **input)
For details and examples, see the :ref:`snakefiles-external_scripts` section in the Documentation.
Step 7: Adding a target rule
...
...
@@ -447,7 +439,7 @@ Here, this means that we add a rule
rule all:
input:
"report.html"
"plots/quals.svg"
to the top of our workflow.
When executing Snakemake with
...
...
@@ -463,14 +455,14 @@ When executing Snakemake with
Snakemake will execute the first per default, you can target any of them via
the command line (e.g., ``snakemake -n mytarget``).
the execution plan for creating the file ``report.html`` which contains and summarizes all our results will be shown.
the execution plan for creating the file ``plots/quals.svg`` which contains and summarizes all our results will be shown.
Note that, apart from Snakemake considering the first rule of the workflow as default target, **the appearance of rules in the Snakefile is arbitrary and does not influence the DAG of jobs**.
Exercise
........
* Create the DAG of jobs for the complete workflow.
* Execute the complete workflow and have a look at the resulting ``report.html`` in your browser.
* Execute the complete workflow and have a look at the resulting ``plots/quals.svg``.
* Snakemake provides handy flags for forcing re-execution of parts of the workflow. Have a look at the command line help with ``snakemake --help`` and search for the flag ``--forcerun``. Then, use this flag to re-execute the rule ``samtools_sort`` and see what happens.
* With ``--reason`` it is possible to display the execution reason for each job. Try this flag together with a dry-run and the ``--forcerun`` flag to understand the decisions of Snakemake.
...
...
@@ -530,23 +522,10 @@ In total, the resulting workflow looks like this:
"bcftools call -mv - > {output}"
rule report:
rule plot_quals:
input:
"calls/all.vcf"
output:
"report.html"
run:
from snakemake.utils import report
with open(input[0]) as vcf:
n_calls = sum(1 for l in vcf if not l.startswith("#"))
report("""
An example variant calling workflow
===================================
Reads were mapped to the Yeast
reference genome and variants were called jointly with
SAMtools/BCFtools.
This resulted in {n_calls} variants (see Table T1_).
To go through this tutorial, you need the following software installed:
* Python_ ≥3.3
* Snakemake_ 3.11.0
* Python_ ≥3.5
* Snakemake_ 5.2.3
* BWA_ 0.7.12
* SAMtools_ 1.3.1
* Pysam_ 0.15.0
* BCFtools_ 1.3.1
* Graphviz_ 2.38.0
* PyYAML_ 3.11
* Docutils_ 0.12
* Jinja2_ 2.10
* NetworkX_ 2.1
* Matplotlib_ 2.2.3
The easiest way to setup these prerequisites is to use the Miniconda_ Python 3 distribution.
The tutorial assumes that you are using either Linux or MacOS X.
...
...
@@ -102,7 +108,13 @@ We will later use Conda_ to create an isolated environment with all required sof
Step 2: Preparing a working directory
:::::::::::::::::::::::::::::::::::::
First, **create a new directory** ``snakemake-tutorial`` at a reasonable place and change into that directory in your terminal.
First, **create a new directory** ``snakemake-tutorial`` at a **reasonable place** and change into that directory in your terminal:
.. code:: console
$ mkdir snakemake-tutorial
$ cd snakemake-tutorial
If you use a Vagrant Linux VM from Windows as described above, create that directory under ``/vagrant/``, so that the contents are shared with your host system (you can then edit all files from within Windows with an editor that supports Unix line breaks).
Then, **change to the newly created directory**.
In this directory, we will later create an example workflow that illustrates the Snakemake syntax and execution environment.
...
...
@@ -110,8 +122,8 @@ First, we download some example data on which the workflow shall be executed:
This tutorial introduces the text-based workflow system Snakemake_.
Snakemake follows the `GNU Make`_ paradigm: workflows are defined in terms of rules that define how to create output files from input files.
Dependencies between the rules are determined automatically, creating a DAG (directed acyclic graph) of jobs that can be automatically parallelized.
Snakemake sets itself apart from existing text-based workflow systems in the following way.
Snakemake sets itself apart from other text-based workflow systems in the following way.
Hooking into the Python interpreter, Snakemake offers a definition language that is an extension of Python_ with syntax to define rules and workflow specific properties.
This allows to combine the flexibility of a plain scripting language with a pythonic workflow definition.
The Python language is known to be concise yet readable and can appear almost like pseudo-code.
The syntactic extensions provided by Snakemake maintain this property for the definition of the workflow.
Further, Snakemake's scheduling algorithm can be constrained by priorities, provided cores and customizable resources and it provides a generic support for distributed computing (e.g., cluster or batch systems).
Hence, a Snakemake workflow scales without modification from single core workstations and multi-core servers to cluster or batch systems.
Finally, Snakemake integrates with the package manager Conda_ and the container engine Singularity_ such that defining the software stack becomes part of the workflow itself.
The examples presented in this tutorial come from Bioinformatics.
However, Snakemake is a general-purpose workflow management system for any discipline.