Low-level Nextflow Hacking

I recently created a proof-of-principle for a deeper integration of jupyter notebooks with nextflow and started implementing my first modules in the new Nextflow DSL2. While doing so, I learned a lot about Groovy, and Nextflow itself.

dungbeetle A small bug: A dung beetle, captured in Piano Battaglia, Sicily, Italy

1. Closures

Closures are omnipresent in nextflow pipelins, and I have used them many times without knowing. For instance in

awesome_channel.map { it -> it + 1}

the expression {it -> it + 1 } is a closure.

Closures are like the anonymous functions I know from Python, except they are not. In an anonymous function, the variables are evaluated in the scope where the function is defined. In a closure they are evaluated in the scope the closure gets executed.

In the following Python code, inc uses a from the outer scope, because this is where it got defined:

Input:

a = 41

inc = lambda _: a + 1

def main():
    a = 0
    print(inc(None))

main()

Output:

However, in the following Groovy code, inc uses a from the inner scope, because this is where it gets executed:

Input:

a = 41

inc = { a + 1}

def main() {
    a = 0
    println inc()
}

main()

Output:

2. Using a dynamic `tag` value

In nf-core modules, for each process, the sample id is used as the tag, such that it niecely shows up in the log. The problem is, the tag directive gets evaluated, before variables from the input channels become available.

Therefore, the following snippet does not work:

Input:

nextflow.enable.dsl = 2                                                         
                                                                                
process foo {                                                                   
    tag id     

    //It does not work either to use `meta.id` directly (without `exec`)
    //tag meta.id                                                          
                                                                                
    input:                                                                      
    val meta                                                                    
                                                                                
    exec:                                                                     
    id = meta.id                                                                
}                                                                               
                                                                                
workflow {                                                                      
    foo([id: 'test'])                                                           
}

Output:

No such variable: id

To work around this, we can use a closure (tag { id }) to defer the evaluation until the process actually gets executed and the values become available:

Input:

nextflow.enable.dsl = 2                                                         
                                                                                
process foo {                                                                   
    tag { id }                                                                  
                                                                                
    input:                                                                      
    val meta                                                                    
                                                                                
    exec:                                                                     
    id = meta.id                                                                
}                                                                               
                                                                                
workflow {                                                                      
    foo([id: 'test'])                                                           
}

Output:

[07/76cf28] process > foo (test) [100%] 1 of 1 ✔

An intriguing detail is that by using Groovy’s string interpolation, we can successfully use tag "${meta.id}", but not tag "${id}". Apparently, the string interpolation defers the execution until the input channels are available, but not until the code in the exec section got executed.

3. The Nextflow “Script” class and the implicit variable “this”

Internally, for Nextflow, each .nf file is an instance of the class Script. From within the file it can be accessed as the implicit variable this.

Its arguments can be retrieved using this.binding.variables, containing, amongst others: * the params, * nextflow implicit variables such as baseDir, workDir, and * variables globally defined in a .nf file (→ script_var in the example).

Note that variables declared using def are not accessible through this (→ other_var in the example) .

Input:

params.bar = "test"                                                             
script_var = 42                                                                 
def other_var = 1                                                     
                                                                                
process foo {                                                                   
    exec:                                                                       
    println this.binding.variables                                              
}                                                                               
                                                                                
workflow {                                                                      
    foo()                                                                       
}

Output:

[
    args:[], params:[bar:test], baseDir:/scratch/nf-test,
    projectDir:/scratch/nf-test, workDir:/scratch/nf-test/work, 
    workflow:repository: null, projectDir: /scratch/nf-test,
    commitId: null, revision: null,
    startTime: 2020-11-28T16:47:44.162643+01:00, endTime: null,
    duration: null, container: {}, commandLine: nextflow ./test.nf,
    nextflow: nextflow.NextflowMeta([...]), success: false,
    workDir: /scratch/nf-test/work, launchDir: /scratch/nf-test,
    profile: standard, nextflow:nextflow.NextflowMeta([...]),
    launchDir:/scratch/nf-test, moduleDir:/scratch/nf-test,
    script_var:42
]

4. Nextflow’s process representation and the “task” implicit variable

I already used the expression task.cpus many times in my nextflow pipelines, without questioning how it works.

When a process gets executed in Nextflow, it is an instance of the class TaskRun, which holds a config object of the instance TaskConfig. The latter is accessible through the implicit variable task.

It contains all variables defined within a process; cpus just happens to be one of them. Additionally, we can access all input variables via task.binding. Again, variables declared with def cannot be accessed.

Input:

process foo {                                                                   
    process_var = 42                                                            
    def other_var = 42                                                          
    cpus = 2                                                                    
                                                                                
    input:                                                                      
    val(meta)                                                                   
                                                                                
    exec:                                                                       
    id = meta.id                                                                
    def local_var = 1                                                           
    println "task = ${task}"                                                    
    println "task.binding = ${task.binding}"                                    
}                                                                               
                                                                                
workflow {                                                                      
    foo([id: "test"])                                                           
}

Output:

task = [
    process_var:42, process:foo, cpus:2, index:1, echo:false,
    validExitStatus:[0], maxRetries:0, maxErrors:-1, 
    shell:[/bin/bash, -ue], executor:local, name:foo, 
    cacheable:true, errorStrategy:TERMINATE,
    workDir:/scratch/nf-test/work/7a/605dabf4d1b454c288434b6381622c, 
    hash:7a605dabf4d1b454c288434b6381622c
]

task.binding = [meta:[id:test], $:true, task:[...], id:test]

5. Debugging Nextflow

I found that Nextflow’s fancy ANSI-logging feature sometimes swallows println statements or error messages. It is possible to turn it off using nextflow run -ansi-log false. I also learned to check the .nextflow.log file more often. It sometimes contains helpful additional information, such as full Java stack traces.

1. Closures

2. Using a dynamic tag value

3. The Nextflow “Script” class and the implicit variable “this”

4. Nextflow’s process representation and the “task” implicit variable

5. Debugging Nextflow

2. Using a dynamic `tag` value