Low-level Nextflow Hacking
I recently created a proof-of-principle for a deeper integration of jupyter notebooks with nextflow and started implementing my first modules in the new Nextflow DSL2. While doing so, I learned a lot about Groovy, and Nextflow itself.
A small bug: A dung beetle, captured in Piano Battaglia, Sicily, Italy
1. Closures
Closures are omnipresent in nextflow pipelins, and I have used them many times without knowing. For instance in
.map { it -> it + 1} awesome_channel
the expression {it -> it + 1 }
is a closure.
Closures are like the anonymous functions I know from Python, except they are not. In an anonymous function, the variables are evaluated in the scope where the function is defined. In a closure they are evaluated in the scope the closure gets executed.
In the following Python code, inc
uses a
from the outer scope, because this is where it got defined:
Input:
= 41
a
= lambda _: a + 1
inc
def main():
= 0
a print(inc(None))
main()
Output:
42
However, in the following Groovy code, inc
uses a
from the inner scope, because this is where it gets executed:
Input:
= 41
a
= { a + 1}
inc
def main() {
= 0
a inc()
println }
main()
Output:
1
2. Using a dynamic tag
value
In nf-core modules, for each process, the sample id is used as the tag
, such that it niecely shows up in the log. The problem is, the tag
directive gets evaluated, before variables from the input channels become available.
Therefore, the following snippet does not work:
Input:
.enable.dsl = 2
nextflow
{
process foo
tag id
//It does not work either to use `meta.id` directly (without `exec`)
//tag meta.id
:
input
val meta
:
exec= meta.id
id }
{
workflow foo([id: 'test'])
}
Output:
No such variable: id
To work around this, we can use a closure (tag { id }
) to defer the evaluation until the process actually gets executed and the values become available:
Input:
.enable.dsl = 2
nextflow
{
process foo { id }
tag
:
input
val meta
:
exec= meta.id
id }
{
workflow foo([id: 'test'])
}
Output:
[07/76cf28] process > foo (test) [100%] 1 of 1 ✔
An intriguing detail is that by using Groovy’s string interpolation, we can successfully use tag "${meta.id}"
, but not tag "${id}"
. Apparently, the string interpolation defers the execution until the input channels are available, but not until the code in the exec
section got executed.
3. The Nextflow “Script” class and the implicit variable “this”
Internally, for Nextflow, each .nf
file is an instance of the class Script
. From within the file it can be accessed as the implicit variable this
.
Its arguments can be retrieved using this.binding.variables
, containing, amongst others: * the params
, * nextflow implicit variables such as baseDir
, workDir
, and * variables globally defined in a .nf
file (→ script_var
in the example).
Note that variables declared using def
are not accessible through this
(→ other_var
in the example) .
Input:
.bar = "test"
params= 42
script_var def other_var = 1
{
process foo :
execthis.binding.variables
println }
{
workflow foo()
}
Output:
[
args:[], params:[bar:test], baseDir:/scratch/nf-test,
projectDir:/scratch/nf-test, workDir:/scratch/nf-test/work,
workflow:repository: null, projectDir: /scratch/nf-test,
commitId: null, revision: null,
startTime: 2020-11-28T16:47:44.162643+01:00, endTime: null,
duration: null, container: {}, commandLine: nextflow ./test.nf,
nextflow: nextflow.NextflowMeta([...]), success: false,
workDir: /scratch/nf-test/work, launchDir: /scratch/nf-test,
profile: standard, nextflow:nextflow.NextflowMeta([...]),
launchDir:/scratch/nf-test, moduleDir:/scratch/nf-test,
script_var:42
]
4. Nextflow’s process representation and the “task” implicit variable
I already used the expression task.cpus
many times in my nextflow pipelines, without questioning how it works.
When a process gets executed in Nextflow, it is an instance of the class TaskRun
, which holds a config object of the instance TaskConfig
. The latter is accessible through the implicit variable task
.
It contains all variables defined within a process; cpus
just happens to be one of them. Additionally, we can access all input
variables via task.binding
. Again, variables declared with def
cannot be accessed.
Input:
{
process foo = 42
process_var def other_var = 42
= 2
cpus
:
inputval(meta)
:
exec= meta.id
id def local_var = 1
"task = ${task}"
println "task.binding = ${task.binding}"
println }
{
workflow foo([id: "test"])
}
Output:
task = [
process_var:42, process:foo, cpus:2, index:1, echo:false,
validExitStatus:[0], maxRetries:0, maxErrors:-1,
shell:[/bin/bash, -ue], executor:local, name:foo,
cacheable:true, errorStrategy:TERMINATE,
workDir:/scratch/nf-test/work/7a/605dabf4d1b454c288434b6381622c,
hash:7a605dabf4d1b454c288434b6381622c
]
task.binding = [meta:[id:test], $:true, task:[...], id:test]
5. Debugging Nextflow
I found that Nextflow’s fancy ANSI-logging feature sometimes swallows println
statements or error messages. It is possible to turn it off using nextflow run -ansi-log false
. I also learned to check the .nextflow.log
file more often. It sometimes contains helpful additional information, such as full Java stack traces.