Data Processing
Main Workflow
File: main.nf
The main workflow handles channel processing and parallel execution. It automatically detects input data types (Illumina, Nanopore, or pre-annotated VCF) and routes them to the dedicated sub-workflows. All inputs are processed concurrently.
Nanopore (Long-Read) Workflow
File: nanopore.nf
For Oxford Nanopore Technologies (ONT) sequencing data
Quality Control:
Tool:
FastQCMetrics: Per-sample quality, GC content, per-base sequence quality, and N-content.
Trimming
Tool:
ChopperFunction: Filters reads based on average quality and minimum length.
Parameters:
min_q = 10,min_l = 500.
Alignment:
Tool:
Minimap2Reference: M. tuberculosis H37Rv (NC_000962.3).
Variant Calling:
Tool:
MedakaModel: r941_e81_sup_variant_g514
Filtering:
Region Filter: Excludes repetitive regions (PE/PPE genes).
Type Filter: SNPs and Indels only.
Depth Filter: Minimum coverage (DP) ≥ 5x.
Quality Filter: Genotype Quality (GQ) ≥ 20.
Illumina (Short-Read) Workflow
File: illumina.nf
For Illumina paired-end sequencing data
Quality Control:
Tool:
FastQCMetrics: Per-sample quality, GC content, per-base sequence quality, and N-content.
Trimming
Tool:
TrimmomaticFunction: Quality trimming.
Settings: Leading/Trailing quality cutoff (3), Sliding Window quality cutoff (4:20), and minimum length (36 bp).
Alignment:
Tool:
BWA-MEM2Reference: M. tuberculosis H37Rv (NC_000962.3).
Variant Calling:
Tool:
GATK HaplotypeCaller
Filtering:
Region Filter: Excludes repetitive regions (PE/PPE genes).
Type Filter: SNPs and Indels only.
Depth Filter: Minimum coverage (DP) ≥ 5x.
Quality Filter: Genotype Quality (GQ) ≥ 20.
VCF Workflow
File: vcf.nf
For pre-annotated/raw variant files
Normalization:
Tool:
bcftools norm
Filtering:
Region Filter: Excludes repetitive regions (PE/PPE genes).
Type Filter: SNPs and Indels only.
Depth Filter: Minimum coverage (DP) ≥ 5x.
Quality Filter: Genotype Quality (GQ) ≥ 20.
Variant Annotation
Tool: bcftools
Variants are matched with data from the WHO TB mutation database to assign drug resistance.
Annotated Fields:
GENE: The gene affected by the variant.DRUG: Antibiotics associated with resistance.EFFECT: Predicted molecular effect (e.g., missense, frameshift).WHO_CLASSIFICATION: Confidence level of resistance association (e.g., “Assoc w R”).
Lineage Classification
Determines the M. tuberculosis lineage based on specific SNP barcodes.
SNP Extraction:
Extracts variants from the VCF that overlap with known lineage markers defined in the BED file.
Classification Algorithm:
Scoring: Calculates the percentage of matching SNPs for each lineage.
Confidence Thresholds:
High: Score ≥ 80% matching SNPs and matched >= 3.
Medium: Score ≥ 60% matching SNPs and matched >= 2.
Low: Does not meet criteria.
FHIR Converter
Converts annotated variant calling data into HL7 FHIR R4 standard resources.
Input Parsing: Reads annotated VCFs and Lineage JSON results.
Mapping:
Drugs: Mapped to SNOMED CT codes.
Variants: Mapped to HGVS nomenclature.
Observations: Uses LOINC codes.
Resource Creation:
Generates
Variant Observation,Drug Susceptibility Observation, andLineage Observationresources and embeds WHO classification resistance data.Generates
DiagnosticReportresource for the conclusion from all variants (e.g., MDR-TB, XDR-TB).
Upload to FHIR Server
File: upload_fhir.nf
For uploading FHIR Genomics bundle with clinical metadata. Must grant bearer token first using scripts/get_access_token.py and fill the clinical metadata on each metadata csv (patient, organization, and practitioner).
Workflow Parameter
nextflow.config defines all input files, directories, versioning, and specific tool parameters, relative to the base directory ($baseDir).