Using RegEx to capture file name in Groovy/NextFlow
My sequencing files are named according to the folowing pattern lane5651_AAGAGGCA_00h_Cell_WT3_L008_R1.fastq.gz
. I would like to capture the Sample ID
as 00h_Cell_WT3
in order to name all downstream files accordingly. To this end, I wrote the following snippet:
#!/usr/bin/env nextflow
// fastq files are stored in reads as paired ends R1 and R4
params.reads = 'reads/lane*_*_*_*_R{1,4}.fastq.gz'
Channel
.fromFilePairs(params.reads, flat: true)
.map { prefix, file1, file2 -> tuple(getSampleID(prefix), file1, file2) }
.set { samples_ch }
def getSampleID( file ) {
// using RegEx to extract the SampleID
// in paried ends, fromFilePairs (with flat: true) returns a triplate
// where the first item is the filename without `R{1,4}.fastq.gz`
// thus the RegEx needs to be adjusted as follow
regexpPE = /([a-z]{4}[0-9]{4})_([A-Z]{8})_(.+)_(L[0-9]{3})/
(file =~ regexpPE)[0][3]
}
process printNames {
input:
set sampleId, forward_reads, reverse_reads from samples_ch
output:
stdout result
"""
echo $sampleId 'and' $forward_reads 'and' $reverse_reads
"""
}
result.subscribe { println it }
For single ends experiments, the following snippet can be used:
#!/usr/bin/env nextflow
// fastq files are stored in reads as single ends R1
params.reads = 'reads/lane*_*_*_*_R1.fastq.gz'
Channel
.fromPath(params.reads)
.map { sample -> tuple(getLibraryId(sample), sample) }
.set { samples_ch }
def getLibraryId( file ) {
regexp = /([a-z]{4}[0-9]{4})_([A-Z]{8})_(.+)_(L[0-9]{3})_(R[1234])(.fastq.gz)/
(file.name =~ regexp)[0][3]
}
process printNames {
input:
set sampleId, file from samples_ch
output:
stdout result
"""
echo $sampleId 'and' $file
"""
}
result.subscribe { println it }