Rapid and Reproducible RSSC Sequevar Typing with Snakemake

Automated sequevar typing for Ralstonia solanacearum from genome to assignment in one command


Background: Why Sequevar Typing Matters

Ralstonia solanacearum species complex (RSSC) is one of the most destructive soilborne plant pathogens worldwide — responsible for bacterial wilt on hundreds of host species including potato, tomato, banana, and ginger. Accurate strain characterisation is critical for:

  • Epidemiological tracking across geographic regions and timepoints
  • Quarantine and regulatory decisions at borders and in diagnostics labs
  • Host range and virulence prediction based on known sequevar biology

The gold standard for RSSC subspecies classification uses two complementary systems:

  • Phylotypes (I–IV + IIA/IIB) — based on geographic/genomic lineage
  • Sequevars — fine-grained groups within phylotypes, defined by sequence diversity in the endoglucanase (egl) gene

Traditionally, sequevar assignment involved manual PCR, Sanger sequencing, multiple alignment, tree construction, and visual inspection — a process that does not scale to modern genomic datasets.

The Solution: RSSC_sequevar_typing Pipeline

The RSSC_sequevar_typing Snakemake pipeline fully automates this workflow. Starting from bacterial genome assemblies (.fna), it extracts the egl gene, performs phylogenetic analysis against 58 curated reference sequevars, and delivers both an annotated tree and a tabular sequevar assignment — all in a single reproducible run.

conda activate snakemake
snakemake --use-conda -j 8

Pipeline Overview

The workflow consists of 7 rules that take genome FASTA files to a sequevar assignment table:

data/*.fna
     │
     ▼
 ┌──────────────┐
 │ 1. run_epcr  │  In-silico PCR — extract egl amplicons from all genomes
 └──────┬───────┘
        ▼
 ┌──────────────────┐
 │ 2. extract_egl   │  Split batch ePCR output into per-sample FASTA files
 └──────┬───────────┘
        ▼
 ┌──────────────┐
 │ 3. trim_egl  │  Trim to ~698 nt phylogenetically informative region
 └──────┬───────┘
        ▼
 ┌──────────────┐
 │ 4. align_egl │  MAFFT alignment (queries + 58 reference sequevars)
 └──────┬───────┘
        ▼
 ┌──────────────┐
 │ 5. build_tree│  IQ-TREE2 ML tree (TIM+I+G4, 1000 UFBoot2 replicates)
 └──────┬───────┘
        │
        ├─────────────────────────────────────┐
        ▼                                     ▼
 ┌────────────────────┐   ┌──────────────────────────────┐
 │ 6. visualize_tree  │   │ 7. assign_sequevar            │
 │ ggtree PDF with    │   │ Nearest-reference tip lookup  │
 │ sequevar annotation│   │ → sequevar_assignments.txt    │
 └────────────────────┘   └──────────────────────────────┘

Steps 2–3 run in parallel across all samples. Steps 6 and 7 also run in parallel once the tree is built.


Step-by-Step Breakdown

Step 1 — In-silico PCR (run_epcr)

The egl gene is extracted from each genome assembly using primer-based in-silico PCR via the companion run_epcr.py script.

Parameter Value
Forward primer ATGCATGCCGCTGGTCGCCGC
Reverse primer GCGTTGCCCGGCACGAACACC
Max amplicon length 2000 bp
Max mismatches 2
Max indels 0

Output: results/00_epcr/all_amplicons.fasta

Step 2 — Extract per-sample egl (extract_egl_gene)

The batch ePCR FASTA is split into individual per-sample files by matching the sample name in the amplicon header. Runs in parallel across all samples.

Step 3 — Trim to phylogenetic region (trim_egl)

The raw amplicon (~900–1100 bp) is trimmed to the ~698 nt phylogenetically informative region using motif-based trimming. This region is the standard used in all sequevar classification literature.

Motifs Values
START patterns ACCGCG, ACCGTG, ACGGCG, ACGGTG, ACTGCG, ATGGTG, CGGGCG
END pattern CAGTGG
Expected size 678–710 nt

When multiple START motifs match, disambiguation uses upstream markers (AGCCT, ACCCT, AGTTT) found within 15 nt before the candidate.

Step 4 — MAFFT alignment (align_egl)

Trimmed query sequences are aligned together with the 58 curated reference sequevar sequences from the CIRAD dataverse using MAFFT (--auto --reorder).

Step 5 — Maximum-likelihood tree (build_tree)

A phylogenetic tree is inferred with IQ-TREE2 using the TIM+I+G4 substitution model — selected by prior jModelTest2 analysis of the egl reference sequences — with 1000 ultrafast bootstrap (UFBoot2) replicates.

Parameter Value
Tool IQ-TREE2 ≥ 2.2
Model TIM+I+G4
Bootstrap 1000 UFBoot2

Step 6 — Annotated tree visualization (visualize_tree)

An annotated, publication-ready tree is rendered as a PDF using ggtree in R. Tip labels are colour-coded by phylotype:

Phylotype Colour
I Red
IIA Blue
IIB Green
III Purple
IV Orange
Query Black

Bootstrap support values ≥ 70% are shown on internal nodes.

Step 7 — Sequevar assignment (assign_sequevar)

Each query genome is assigned a phylotype and sequevar by finding the closest reference tip in the ML tree using patristic (branch-length) distances. The output is a tab-delimited table:

Column Description
Genome Query genome name (.fna file stem)
Phylotype Assigned RSSC phylotype (I, IIA, IIB, III, IV)
Sequevar Assigned egl sequevar number
Closest_Reference FASTA ID of the nearest reference tip
Tree_Distance Patristic distance to the closest reference

Note: A Tree_Distance > 0.05 may indicate a divergent or novel genotype. Always inspect the tree visualization alongside the assignment table.


Output Files

Output Path
ePCR amplicons (batch) results/00_epcr/all_amplicons.fasta
Per-sample egl genes results/01_egl_genes/{sample}.egl.fasta
Trimmed egl sequences (~698 nt) results/02_egl_genes_trimmed/{sample}.egl.trimmed.fasta
Multiple sequence alignment results/03_alignment/egl_aligned.fasta
ML tree (Newick) results/04_trees/IQTree2.egl.tree
Annotated tree visualization tree_visualizations/IQTree2.egl.tree.visualization.pdf
Sequevar assignments results/05_sequevar_assignments/sequevar_assignments.txt

Quick Start

1. Clone and place genomes

git clone https://gitlab.ilvo.be/stevebaeyen/rssc_sequevar_typing.git
cd rssc_sequevar_typing
cp /path/to/your/genomes/*.fna data/

2. Run the pipeline

conda activate snakemake
snakemake --use-conda -j 8

3. Preview without running

snakemake -n -p

4. Run on a SLURM cluster

snakemake --use-conda -j 100 \
    --cluster "sbatch -p batch -c {threads} --mem=4G -t 01:00:00" \
    --configfile config.yaml

Configuration

All parameters are centralised in config.yaml:

data_dir: "data"

epcr:
  max_length: 2000
  mismatches: 2
  indels: 0

trimming:
  start_patterns: [ACCGCG, ACCGTG, ACGGCG, ACGGTG, ACTGCG, ATGGTG, CGGGCG]
  end_pattern: CAGTGG

iqtree2:
  model: "TIM+I+G4"
  bootstrap: 1000

resources:
  epcr_threads: 32
  mafft_threads: 16
  iqtree_threads: 32

Requirements

All tool dependencies (MAFFT, IQ-TREE2, ggtree, Biopython, etc.) are installed automatically via per-rule conda environments. No manual dependency management required.


Citation

If you use this pipeline in your research, please cite:

Baeyen, S. (2026). Snakemake pipeline for Ralstonia solanacearum species complex (RSSC) sequevar typing based on the endoglucanase (egl) gene (Version 1.0.0). Zenodo. https://doi.org/10.5281/zenodo.19002750


References

  • Cellier, G., Pecrix, Y., Gauche, M. M., & Cheron, J. J. (2023). Ralstonia solanacearum species complex egl reference database. CIRAD Dataverse. https://doi.org/10.18167/DVN1/CUWA5P
  • Fegan, M. & Prior, P. (2005). How complex is the Ralstonia solanacearum species complex? In: Bacterial Wilt Disease and the Ralstonia solanacearum Species Complex, APS Press.
  • Wicker, E. et al. (2007). Ralstonia solanacearum strains from Martinique exhibiting a new pathogenic potential. Applied and Environmental Microbiology 73(21):6790–6801.
  • Nguyen, L.-T. et al. (2015). IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution 32(1):268–274.
  • Yu, G. et al. (2017). ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 8(1):28–36.

Pipeline repository: https://gitlab.ilvo.be/stevebaeyen/rssc_sequevar_typing
License: MIT
Author: Steve Baeyen — ILVO Plant Sciences Unit




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Superior Bacterial Genome Assembly with Autocycler Consensus and Snakemake
  • Adventures in Bacterial Genome Assembly with Snakemake
  • Introducing the qPCR Primer Analyzer ILVO
  • Batch In-Silico (q)PCR Made Easy with run_epcr.py
  • From Bash to Copilot: My Bioinformatics Terminal Workflow