Summary and Schedule

A supplementary lesson for users of the Snakemake workflow system that covers how to share your worflows. This is not just a matter of uploading code, but a whole way of thinking about the workflow as a re-usable entity. Tools and approaches that lead to re-usable code are introduced with practical examples, and the WorkflowHub.eu site is presented as a publicly-funded and standards-compliant repository for shared workflows.

This lesson is designed to be taught in around in one day, and follows on from the material in the Snakemake for Bioinformatics episodes.

Learner Prerequisites

If learners have not worked through the introductory Snakemake for Bioinformatics course then they should be already familiar with writing pipelines in Snakemake, for example:

Writing rules and linking them via input/output filenames
Visualising the DAG generated by Snakemake (--dag option)
Using --configfile and config items with Snakemake workflows
Specifying Conda environments for Snakemake rules

Learners should also follow the set-up instructions for the introductory course, in order to have the full software environment.

Notes

WorkflowHub.eu is a FAIR registry for describing, sharing and publishing scientific computational workflows. The registry is sponsored by the European RI Cluster EOSC-Life, the European Research Infrastructure ELIXIR, and multiple EU-wide projects.

This lesson was built with The Carpentries Workbench.

Setup Instructions

Download files required for the lesson

00h 00m

1. Making a toy dataset

What do we mean by testing in the context of workflows?
What makes a suitable dataset for testing?
How do we go about designing tests?

01h 00m

2. Separating code and configuration

What needs to be configurable in our workflow?
How do we express these settings using the Snakemake config mechanism?
How does this fit in with testing?

02h 00m

3. Using standard file locations

What tools exist to check that the workflow files are all present and correct?

03h 00m

4. Source code control

What are the key ideas of source control?
What are the essential GIT commands we need to manage our simple code?
How can git tags help to manage our code updates?

04h 00m

5. Adding a license

What OSI licenses are available?
How do I record ownership and authorship of my code?

05h 00m

6. Upload to WorkflowHub

Why should I use WorkflowHub?
How does WorkflowHub compare with the Snakemake Workflow Catalogue?

06h 00m

7. RO-Crates

What are RO-crates in general?
What are the specific types of RO-Crate?
What can I do with an RO-Crate?

07h 00m

8. Releasing a new version

What should I do to update the workflow?

08h 00m

Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.

Data Sets

We’ll need to start with a simple example Snakefile, which in this case is the sample answer given to the sequence assembly challenge.

Download the Snakefile and save it.

If you do not have the yeast files from the intro course, unpack the sample dataset tarball from https://figshare.com/ndownloader/files/42467370

You may do this in the shell with the command:

BASH

$ wget --content-disposition https://figshare.com/ndownloader/files/42467370

The tar file needs to be unpacked to yield the directory of files used in the course. In the shell you may do this with:

BASH

$ tar -xvaf data-for-snakemake-novice-bioinformatics.tar.xz

You will also need to rename the files as is normally done at the start of episode 06.

BASH

$ cd yeast/reads
$ rename -v -s ref ref_ *

See this link for details about this dataset and the redistribution license.

Software Setup

Details

You will need the Snakemake software installed and working with conda. One way to do this is to follow the same setup instructions as for the Snakemake for Bioinformatics lesson.

Windows

This lesson is currently not tested to work on Windows. You may use a WSL Linux environment, or else connect to a Linux system.

MacOS

Use Terminal.app

Linux

You are good.