Data Processing

Main Workflow

File: main.nf The main workflow handles channel processing and parallel execution. It automatically detects input data types (Illumina, Nanopore, or pre-annotated VCF) and routes them to the dedicated sub-workflows. All inputs are processed concurrently.

Nanopore (Long-Read) Workflow

File: nanopore.nf For Oxford Nanopore Technologies (ONT) sequencing data

  1. Quality Control:

    • Tool: FastQC

    • Metrics: Per-sample quality, GC content, per-base sequence quality, and N-content.

  2. Trimming

    • Tool: Chopper

    • Function: Filters reads based on average quality and minimum length.

    • Parameters: min_q = 10, min_l = 500.

  3. Alignment:

    • Tool: Minimap2

    • Reference: M. tuberculosis H37Rv (NC_000962.3).

  4. Variant Calling:

    • Tool: Medaka

    • Model: r941_e81_sup_variant_g514

  5. Filtering:

    • Region Filter: Excludes repetitive regions (PE/PPE genes).

    • Type Filter: SNPs and Indels only.

    • Depth Filter: Minimum coverage (DP) ≥ 5x.

    • Quality Filter: Genotype Quality (GQ) ≥ 20.

Illumina (Short-Read) Workflow

File: illumina.nf For Illumina paired-end sequencing data

  1. Quality Control:

    • Tool: FastQC

    • Metrics: Per-sample quality, GC content, per-base sequence quality, and N-content.

  2. Trimming

    • Tool: Trimmomatic

    • Function: Quality trimming.

    • Settings: Leading/Trailing quality cutoff (3), Sliding Window quality cutoff (4:20), and minimum length (36 bp).

  3. Alignment:

    • Tool: BWA-MEM2

    • Reference: M. tuberculosis H37Rv (NC_000962.3).

  4. Variant Calling:

    • Tool: GATK HaplotypeCaller

  5. Filtering:

    • Region Filter: Excludes repetitive regions (PE/PPE genes).

    • Type Filter: SNPs and Indels only.

    • Depth Filter: Minimum coverage (DP) ≥ 5x.

    • Quality Filter: Genotype Quality (GQ) ≥ 20.

VCF Workflow

File: vcf.nf For pre-annotated/raw variant files

  1. Normalization:

    • Tool: bcftools norm

  2. Filtering:

    • Region Filter: Excludes repetitive regions (PE/PPE genes).

    • Type Filter: SNPs and Indels only.

    • Depth Filter: Minimum coverage (DP) ≥ 5x.

    • Quality Filter: Genotype Quality (GQ) ≥ 20.

Variant Annotation

Tool: bcftools Variants are matched with data from the WHO TB mutation database to assign drug resistance. Annotated Fields:

  • GENE: The gene affected by the variant.

  • DRUG: Antibiotics associated with resistance.

  • EFFECT: Predicted molecular effect (e.g., missense, frameshift).

  • WHO_CLASSIFICATION: Confidence level of resistance association (e.g., “Assoc w R”).

Lineage Classification

Determines the M. tuberculosis lineage based on specific SNP barcodes.

  1. SNP Extraction:

    • Extracts variants from the VCF that overlap with known lineage markers defined in the BED file.

  2. Classification Algorithm:

    • Scoring: Calculates the percentage of matching SNPs for each lineage.

    • Confidence Thresholds:

      • High: Score ≥ 80% matching SNPs and matched >= 3.

      • Medium: Score ≥ 60% matching SNPs and matched >= 2.

      • Low: Does not meet criteria.

FHIR Converter

Converts annotated variant calling data into HL7 FHIR R4 standard resources.

  1. Input Parsing: Reads annotated VCFs and Lineage JSON results.

  2. Mapping:

    • Drugs: Mapped to SNOMED CT codes.

    • Variants: Mapped to HGVS nomenclature.

    • Observations: Uses LOINC codes.

  3. Resource Creation:

    • Generates Variant Observation, Drug Susceptibility Observation, and Lineage Observation resources and embeds WHO classification resistance data.

    • Generates DiagnosticReport resource for the conclusion from all variants (e.g., MDR-TB, XDR-TB).

Upload to FHIR Server

File: upload_fhir.nf For uploading FHIR Genomics bundle with clinical metadata. Must grant bearer token first using scripts/get_access_token.py and fill the clinical metadata on each metadata csv (patient, organization, and practitioner).

Workflow Parameter

nextflow.config defines all input files, directories, versioning, and specific tool parameters, relative to the base directory ($baseDir).