Skip to Content

Using RegEx to capture file name in Groovy/NextFlow

My sequencing files are named according to the folowing pattern lane5651_AAGAGGCA_00h_Cell_WT3_L008_R1.fastq.gz. I would like to capture the Sample ID as 00h_Cell_WT3 in order to name all downstream files accordingly. To this end, I wrote the following snippet:

#!/usr/bin/env nextflow

// fastq files are stored in reads as paired ends R1 and R4
params.reads = 'reads/lane*_*_*_*_R{1,4}.fastq.gz'

Channel
     .fromFilePairs(params.reads, flat: true)
     .map { prefix, file1, file2 -> tuple(getSampleID(prefix), file1, file2) }
     .set { samples_ch }

def getSampleID( file ) {
    // using RegEx to extract the SampleID
    // in paried ends, fromFilePairs (with flat: true) returns a triplate
    //     where the first item is the filename without `R{1,4}.fastq.gz`
    //     thus the RegEx needs to be adjusted as follow
    regexpPE = /([a-z]{4}[0-9]{4})_([A-Z]{8})_(.+)_(L[0-9]{3})/
    (file =~ regexpPE)[0][3]
}

process printNames {
    input:
    set sampleId, forward_reads, reverse_reads from samples_ch

    output:
    stdout result

    """
    echo $sampleId 'and' $forward_reads 'and' $reverse_reads
    """
}

result.subscribe { println it }

For single ends experiments, the following snippet can be used:

#!/usr/bin/env nextflow

// fastq files are stored in reads as single ends R1
params.reads = 'reads/lane*_*_*_*_R1.fastq.gz'

Channel
     .fromPath(params.reads)
     .map { sample -> tuple(getLibraryId(sample), sample) }
     .set { samples_ch }

def getLibraryId( file ) {
  regexp = /([a-z]{4}[0-9]{4})_([A-Z]{8})_(.+)_(L[0-9]{3})_(R[1234])(.fastq.gz)/
  (file.name =~ regexp)[0][3]
}

process printNames {
	input:
	set sampleId, file from samples_ch

	output:
	stdout result

	"""
    echo $sampleId 'and' $file
	"""
}

result.subscribe { println it }