Quick introduction to GWF

Install

Documentation: http://mailund.github.io/gwf/

pip install https://github.com/mailund/gwf/archive/develop.zip

Functions

  1. specify a workflow
  2. submit jobs
  3. check job status

An very easy example

1
2
3
4
5
from gwf import *

target('UnZipGenome', input='ponAbe2.fa.gz', output='ponAbe2.fa') << '''
gzcat ponAbe2.fa.gz > ponAbe2.fa
'''

we only need to specify tasks in workflow.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from gwf import *

target('UnZipGenome', input='ponAbe2.fa.gz', output='ponAbe2.fa') << '''
gzcat ponAbe2.fa.gz > ponAbe2.fa
'''


target('IndexGenome', input='ponAbe2.fa',
output=['ponAbe2.amb', 'ponAbe2.ann', 'ponAbe2.pac']) << '''
bwa index -p ponAbe2 -a bwtsw ponAbe2.fa
'''


target('MapReads',
input=['ponAbe2.fa',
'ponAbe2.amb', 'ponAbe2.ann', 'ponAbe2.pac',
'Masala_R1.fastq.gz', 'Masala_R2.fastq.gz'],
output='Masala.sorted.rmdup.bam') << '''

bwa mem -t 16 ponAbe2.fa Masala_R1.fastq.gz Masala_R2.fastq.gz | \
samtools view -Shb - > /scratch/$GWF_JOBID/unsorted.bam

samtools sort -o /scratch/$GWF_JOBID/unsorted.bam /scratch/$GWF_JOBID/sort | \
samtools rmdup -s - Masala.sorted.rmdup.bam
'''

Templates

Templates make it easier for us to write ‘target’ function, and they can be
used cross projects (just like what you can do with a latex templates).

Let’s re-write the easy example with templates.

1
2
3
4
5
6
7
# This is the template part
unzip = template(input='{refGenome}.fa.gz', output='{refGenome}.fa') << '''
gzcat {refGenome}.fa.gz > {refGenome}.fa
'''


# This is the workflow part
target('UnZipGenome') << unzip(refGenome='ponAbe2')

What’s more, we can also generate templates. In fact, any function that
returns a dictionary options and a shell command shell_spec can serve as a
generator of a template.

1
2
3
4
5
6
7
8
9
10
def merge(individual, inputfiles):
outputfile = '{}.bam'.format(individual)
options = {'input': inputfiles, 'output': outputfile}
shell_spec = '''

samtools merge - {inputbams} | samtools rmdup -s - {name}.bam

'''.format(inputbams = ' '.join(inputfiles), name=individual)


return (options, shell_spec)

Some useful functions

glob and shell function

1
2
3
4
5
6
files = glob("*")
the_same_files = shell("ls")
regions = glob("data/covered_region/*")

# always return a list
dir = shell('pwd')[0]

overwrite options: nodes, cores, memory, walltime…

1
2
3
4
5
6
sort_template = template(input='{annotation}', output='{annotation}.sorted') << '''
sort -k 1,1 -k 2n,2 {annotation} -o {annotation}.sorted
'''


# deal with very large file, default 2g.
target('sort', memory='8g') << sort_template()

infix operator

function f(g(x,y), z) can be called by:

1
x<<g>>y<<f>>z

very useful when dealing with string (file names, directories…)
three basic ones tag, suffix, outdir

1
2
3
4
x = ["qux/foo.a", "bar.b"]
x <<outdir>> "qax" <<tag>> "qux" <<suffix>> ".c"

['qax/qux_foo.c', 'qax/qux_bar.c']

Check job status

1
2
gwf
gwf --status

A more user-friendly way than ‘mj’. Logs are automatically generated.