BiG-SCAPE Guide: Network Analysis of Biosynthetic Gene Clusters for Drug Discovery

Paisley Howard Jan 09, 2026 294

This comprehensive guide provides researchers, scientists, and drug development professionals with a deep dive into BiG-SCAPE, the premier tool for analyzing Biosynthetic Gene Cluster (BGC) Families.

BiG-SCAPE Guide: Network Analysis of Biosynthetic Gene Clusters for Drug Discovery

Abstract

This comprehensive guide provides researchers, scientists, and drug development professionals with a deep dive into BiG-SCAPE, the premier tool for analyzing Biosynthetic Gene Cluster (BGC) Families. We begin by establishing the foundational concepts of BGC diversity and the critical role of genome mining in natural product discovery. The article then details the methodological workflow of BiG-SCAPE, from installation and input preparation to interpreting its correlation network outputs. Practical guidance is offered for troubleshooting common computational issues and optimizing parameters for specific research goals. We validate BiG-SCAPE's utility by comparing its performance and features against alternative tools like antiSMASH and PRISM, highlighting its unique niche in generating BGC family networks. The conclusion synthesizes how BiG-SCAPE accelerates the prioritization of novel bioactive compounds, directly impacting modern biomedical and clinical research pipelines.

What is BiG-SCAPE? Understanding BGC Families and the Need for Comparative Genomics

Biosynthetic Gene Clusters (BGCs) are sets of co-localized genes that collectively encode the molecular machinery for producing a specialized metabolite or Natural Product (NP). Within the thesis framework of BiG-SCAPE analysis, BGCs represent the fundamental genomic units whose diversity, evolution, and network relationships are investigated to prioritize novel chemistry for drug discovery. BGCs typically encode enzymes for core scaffold assembly (e.g., polyketide synthases, non-ribosomal peptide synthetases), modification, regulation, and transport.

Table 1: Quantitative Overview of Major BGC Classes and Their Products

BGC Class Core Biosynthetic Enzyme(s) Approx. % of Known Microbial NPs* Example Drug(s)
Non-Ribosomal Peptide (NRPS) NRPS ~35% Vancomycin, Penicillin
Type I Polyketide (T1PKS) Type I PKS ~25% Erythromycin, Rapamycin
Terpene Terpene Synthases/Cyclases ~15% Artemisinin, Taxol
Ribosomally synthesized and post-translationally modified peptides (RiPPs) Precursor Peptide & Modification Enzymes ~10% Nisin, Thiostrepton
Hybrid (e.g., NRPS-PKS) Mixed NRPS/PKS ~15% Bleomycin

*Representative distribution based on curated databases like MIBiG. Percentages are approximate and vary by organismal source.

Protocols for BGC Discovery and Analysis in a BiG-SCAPE Workflow

Protocol 2.1: Genome Mining and BGC Prediction

This protocol details the initial identification of BGCs from genomic data, a prerequisite for BiG-SCAPE family analysis.

Materials (Research Reagent Solutions):

  • antiSMASH: A computational toolkit for the automated identification and annotation of BGCs in genomic data. It is the primary detection engine.
  • Genome Assembly: High-quality microbial genome assembly in FASTA format. Input quality dictates prediction accuracy.
  • MIBiG Database: Reference database of known BGCs. Used for comparative analysis and initial functional annotation.
  • Python Environment: Required for running antiSMASH and downstream processing scripts.

Methodology:

  • Input Preparation: Ensure your genomic assembly is in FASTA format. Check for contamination and completeness using tools like CheckM.
  • antiSMASH Execution: Run antiSMASH locally or via the web server. Use the --genefinding-tool prodigal option for standard bacterial genomes.

  • Output Parsing: antiSMASH generates a directory containing a JSON file (*.json) with structured BGC predictions, a GenBank file with annotations, and an HTML overview. Extract the GenBank files of each predicted BGC region for downstream steps.
  • Data Curation: Manually review the predicted BGC boundaries and core biosynthetic types. This step is critical for generating a high-quality input dataset for BiG-SCAPE.

Table 2: Key antiSMASH Parameters for BGC Detection

Parameter Recommended Setting Purpose
--genefinding-tool prodigal (bacteria) Predicts protein-coding genes.
--cb-knownclusters Enabled Checks for known clusters against MIBiG.
--cb-subclusters Enabled Detects sub-cluster regions for chemical diversity.
--clusterhmmer Enabled Uses HMM profiles for cluster detection.

Protocol 2.2: Generating Gene Cluster Families (GCFs) with BiG-SCAPE

This protocol follows the core thesis context, using BiG-SCAPE to group predicted BGCs into families based on pairwise similarity.

Materials (Research Reagent Solutions):

  • BiG-SCAPE Software: Python-based tool for calculating pairwise distances between BGCs and generating sequence similarity networks.
  • Pfams Database: File of HMM profiles for protein domain annotation (Pfam-A.hmm). Used for domain-based similarity calculation.
  • Input BGCs: Collection of GenBank files from Protocol 2.1 (or public databases).
  • MAFFT: Multiple sequence alignment tool, required for the optional alignment mode.

Methodology:

  • Input Preparation: Place all BGC GenBank files in a single directory (e.g., ./my_bgcs/).
  • Run BiG-SCAPE in default (fast) mode:

  • Network Analysis: BiG-SCAPE outputs network files (.network), SVG/PDF visualizations, and a summary table. Gene Cluster Families (GCFs) are defined as groups of BGCs with high connectivity in the network. Analyze the index.html in the output folder.
  • Interpretation: BGCs within the same GCF are hypothesized to produce similar or structurally related natural products. These GCFs become targets for prioritization (e.g., those in poorly studied branches of the network).

G Input Input Genomic Data (FASTA/GenBank) BGC_Pred BGC Prediction (antiSMASH) Input->BGC_Pred GBK_Files BGC GenBank Files BGC_Pred->GBK_Files BiG_SCAPE Similarity Network Analysis (BiG-SCAPE) GBK_Files->BiG_SCAPE GCF_Network Gene Cluster Family (GCF) Network BiG_SCAPE->GCF_Network Prio GCF Prioritization for Novel Chemistry GCF_Network->Prio

Title: BiG-SCAPE Workflow for BGC Family Analysis

Protocol 2.3: Targeted PCR Verification of a Predicted BGC

Following in silico prediction and prioritization, this protocol validates the physical presence of a BGC in a microbial strain.

Materials (Research Reagent Solutions):

  • Bacterial Genomic DNA: High-molecular-weight DNA from the candidate producer strain.
  • Specific PCR Primers: Designed to amplify a key, diagnostic region of the predicted BGC (e.g., a portion of a core biosynthetic gene).
  • High-Fidelity DNA Polymerase: Enzyme for accurate amplification of long or GC-rich regions common in BGCs.
  • Gel Extraction Kit: For purifying the amplified PCR product for sequencing.
  • Sanger Sequencing Service: To confirm the DNA sequence matches the in silico prediction.

Methodology:

  • Primer Design: Using the antiSMASH-predicted BGC sequence, design primers (~20-25 bp) with a Tm of ~60°C to amplify a 1-2 kb internal fragment.
  • PCR Setup: Set up a 25 µL reaction with high-fidelity polymerase, 50-100 ng genomic DNA, and 0.5 µM each primer.
  • Thermocycling:
    • 98°C for 30 s (initial denaturation)
    • 35 cycles of: 98°C for 10 s, 60°C for 20 s, 72°C for 1 min/kb
    • 72°C for 5 min (final extension)
  • Analysis: Run the PCR product on a 1% agarose gel. A single band of the expected size provides initial validation. Purify the band and submit for Sanger sequencing. Align the returned sequence to the predicted BGC.

G DNA Genomic DNA (Template) PCR PCR Amplification DNA->PCR Primers Specific Primers Primers->PCR PCR_Mix PCR Master Mix (High-Fidelity Polymerase) PCR_Mix->PCR Gel Agarose Gel Electrophoresis PCR->Gel Seq Product Purification & Sanger Sequencing Gel->Seq Val Sequence Alignment & BGC Validation Seq->Val

Title: PCR Verification Protocol for Predicted BGCs

Table 3: Key Research Reagent Solutions for BGC Analysis

Item Category Function in BGC Research
antiSMASH Software Core genome mining tool for BGC prediction and annotation.
BiG-SCAPE & CORASON Software Analyzes BGC sequence similarity, creates networks, and infers phylogeny.
MIBiG Database Database Reference repository for experimentally characterized BGCs. Essential for training and comparison.
Pfam Database Database Library of protein domain HMMs. Used by antiSMASH and BiG-SCAPE for functional annotation.
High-Fidelity PCR Kit Wet-lab Reagent Amplifies specific regions of BGCs from genomic DNA for validation.
Heterologous Expression Host (e.g., S. albus, A. nidulans) Biological System Chassis for expressing silent or complex BGCs to discover their products.
LC-MS/MS Instrumentation Analytical Equipment Critical for detecting and characterizing the natural products synthesized by BGCs.

Application Notes

Genomic Context and BiG-SCAPE Analysis

The identification and classification of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs) is central to modern natural product discovery. BiG-SCAPE (Biosynthetic Gene Similarity Clustering and Prospecting Engine) addresses the challenge of BGC diversity by enabling large-scale, sequence-based networking and phylogenomic analysis.

Key Quantitative Findings (Representative Data from Recent Studies):

Table 1: BiG-SCAPE Analysis Output Statistics (Example Dataset: 10,000 BGCs from MIBiG & Genomes)

Metric Value Description
Input BGCs 10,000 Number of BGCs analyzed (predicted + known)
Network Families (GCFs) 1,850 Clusters of related BGCs (cutoff: 0.3 distance)
Singleton BGCs 420 BGCs not grouped into any GCF
Largest GCF Size 287 Number of BGCs in the largest family (often NRPS/PKS)
Average GCF Size 5.2 Mean number of BGCs per family
Core Biosynthetic Classes 8 Major classes identified (e.g., NRPS, T1PKS, RiPPs, Terpene)

Table 2: Correlation Between GCF Size and Discovery Potential

GCF Category Avg. BGCs per GCF Likelihood of Novel Chemistry* Prioritization for Heterologous Expression
Large GCFs (>50 BGCs) 112 Low to Medium (well-explored) Lower; focus on underrepresented strains
Medium GCFs (5-50 BGCs) 18 Medium to High High; balanced diversity & homology
Small GCFs (2-5 BGCs) 3.2 High Very High; likely unique or variant chemistry
Singletons 1 Very High (but risk of false positives) Case-by-case; requires validation

*Based on the diversity of precursor and modification enzymes within the GCF.

Strategic Implications for Drug Discovery

Grouping BGCs into GCFs allows researchers to prioritize targets. Large GCFs may represent widely conserved metabolites with known bioactivity, while smaller GCFs and singletons are hotspots for novel scaffold discovery. BiG-SCAPE's network visualization facilitates the identification of "neighborhoods" of BGCs that share hybrid architectures or unusual combinations of biosynthetic modules.

Detailed Protocols

Protocol: BiG-SCAPE Workflow for GCF Analysis

Aim: To generate and analyze Gene Cluster Families from a set of predicted or known BGCs.

Materials & Software:

  • Input Data: AntiSMASH (v6.0+) GenBank output files for BGCs.
  • BiG-SCAPE: Version 1.1.5 or higher (https://git.wageningenur.nl/medema-group/BiG-SCAPE).
  • Prerequisites: Python 3.7+, hmmer, mafft, FastTree, DIAMOND.
  • Computing: Linux cluster or high-performance workstation (≥16 GB RAM recommended for large runs).

Procedure:

  • Data Preparation:

    • Collect all GenBank files (.gbk, .gbff) from AntiSMASH runs into a single directory (e.g., my_bgcs/).
    • Ensure files are correctly named and only contain one BGC per file.
  • Running BiG-SCAPE Core Analysis:

    • Execute the main script from the BiG-SCAPE installation directory.

    • Parameters Explained:

      • -c 8: Number of CPU cores to use.
      • --pfam_dir: Path to directory containing Pfam database files.
      • --inputdir: Directory containing input GenBank files.
      • --outputdir: Desired output directory.
  • Generating Networks with Custom Cutoff:

    • To generate sequence similarity networks at a specific cutoff (e.g., 0.7 for more stringent families):

  • Output Interpretation:

    • Results are found in ./bigscape_results.
    • Network Files: .network files (for use in Cytoscape).
    • HTML Visualization: Navigate to ./bigscape_results/network_html/index.html to explore GCFs interactively in a web browser.
    • Tabular Data: mix/ folder contains .tsv files detailing cluster similarity scores and family assignments.
  • Downstream Analysis (Guideline):

    • Prioritize GCFs: Focus on GCFs of medium size containing BGCs from underexplored taxonomic branches.
    • Extract Core Biosynthetic Genes: Use the fasta files generated for each GCF to perform multiple sequence alignment and build phylogenetic trees of key enzymes (e.g., KS, C domains).
    • Correlate with Metabolomics: Map metabolomic data (e.g., MS/MS molecular networking from GNPS) to GCFs if strain extracts are available.

Protocol: Targeted Genome Mining Guided by GCF

Aim: To discover novel variants of a specific BGC class (e.g., lasso peptides) across a bacterial genus.

  • Database Construction: Download all genome assemblies for target genus from NCBI.
  • BGC Prediction: Run AntiSMASH on all genomes with relaxed detection strictness for the target class.
  • GCF Construction: Run BiG-SCAPE on the resulting BGCs alongside known reference BGCs from MIBiG.
  • Family Identification: Identify the GCF containing your reference BGCs. Examine its composition.
  • Strain Selection: Select producer strains harboring BGCs that are phylogenetically distinct within the GCF but share the core architecture.
  • Heterologous Expression: Design primers to amplify the entire candidate BGC (using e.g., REDIRECT cloning or Cas9-assisted assembly) and express it in a suitable host (e.g., S. albus).
  • Compound Isolation: Culture the expression host, extract metabolites, and use HPLC-MS guided by the predicted core structure mass.

Visualizations

bigscape_workflow start Input: Microbial Genomes as BGC Prediction (AntiSMASH) start->as bgc_db BGC Collection (.gbk files) as->bgc_db bs BiG-SCAPE Core (PFAM domain analysis, pairwise distance) bgc_db->bs net Similarity Network Generation bs->net gcf Gene Cluster Families (GCFs) Identified net->gcf viz Visualization & Analysis (HTML, Cytoscape) gcf->viz prioritization Target Prioritization for Expression viz->prioritization

Title: BiG-SCAPE Analysis Workflow from Genomes to GCFs

gcf_prioritization_logic start_q Candidate BGC Assigned to a GCF? q1 GCF Size Small/Medium? start_q->q1 Yes act_high HIGH PRIORITY Proceed with Expression start_q->act_high No (Singleton) q2 Taxonomic Origin Rare/Uncultivated? q1->q2 Yes act_low LOW PRIORITY Likely Known Metabolite q1->act_low No (Large GCF) q3 Contains Novel Domain Architecture? q2->q3 Yes act_med MEDIUM PRIORITY Genomic Context Analysis q2->act_med No q3->act_high Yes q3->act_med No

Title: Decision Logic for Prioritizing BGCs Based on GCF Analysis

The Scientist's Toolkit

Table 3: Essential Research Reagent Solutions for BGC Mining & GCF Analysis

Item Function & Application Key Consideration
AntiSMASH Database Rule-based detection & annotation of BGCs in genomic data. Versions >6.0 include TIGRFam and ClusterBlast improvements.
Pfam & MIBiG Databases Pfam: HMM profiles for domain detection. MIBiG: Repository of known BGCs for reference. Must be locally installed and updated for BiG-SCAPE.
BiG-SCAPE / CORASON BiG-SCAPE: Creates sequence similarity networks. CORASON: Phylogenetic-focused analysis of specific BGC types. BiG-SCAPE for broad surveys; CORASON for deep dives into a class.
Cytoscape Open-source platform for visualizing and exploring similarity networks (.network files). Use enhancedGraphics plugin for custom node annotations.
Heterologous Host (e.g., S. albus J1074) Clean genetic background host for BGC expression and compound production. Must optimize culture and transformation protocols for the host.
Gibson or REDIRECT Cloning Kits Assembly of large, often >50 kb, BGC fragments for heterologous expression. Fidelity and efficiency for large insert assembly is critical.
LC-HRMS/MS System High-resolution metabolomics to detect novel compounds from expression hosts. Couple with GNPS for molecular networking to link chemistry to GCFs.

Core Algorithm and Quantitative Framework

BiG-SCAPE (Biosynthetic Gene Similarity Clustering and Prospecting Engine) is a computational tool for the global analysis of Biosynthetic Gene Clusters (BGCs). It performs an all-vs-all comparison of BGCs encoded in standard GenBank files, calculates pairwise similarity metrics, and hierarchically clusters them into Gene Cluster Families (GCFs). Its algorithm is integral to studying the phylogeny and evolution of secondary metabolism.

Table 1: BiG-SCAPE Core Algorithm Parameters and Outputs

Component Parameter/Output Description & Quantitative Metric
Input GenBank files Requires antiSMASH 5+ annotated BGCs (.gbk). Handles diverse types (PKS, NRPS, RiPPs, etc.).
Distance Calculation Jaccard Index (Domain-based) Measures similarity based on shared Pfam domains: J(A,B) = ∣A∩B∣ / ∣A∪B∣. Ranges from 0 (no shared domains) to 1 (identical domain sets).
Adjacency Index Penalizes similarity based on differences in the order of shared domains.
Combined Distance Final distance = 1 - (ω * Jaccard) + ((1-ω) * Adjacency Index). Default ω (weight) = 0.2.
Clustering Hierarchical Clustering Uses the pairwise distance matrix. Default method: Average linkage (UPGMA).
Cut-off Value Defines GCF boundaries. Common heuristic cut-off: 0.75 (distance). Lower = stricter, fewer BGCs per GCF.
Network Analysis Nodes Represent individual BGCs. Node size can be scaled by BGC length or number of domains.
Edges Connect BGCs with a pairwise distance below the selected cut-off (e.g., 0.75). Edge weight = 1 - distance.
Output Gene Cluster Families (GCFs) Groups of BGCs putatively encoding the synthesis of structurally related compounds. A run on 10,000 BGCs typically yields 1,500-2,500 GCFs.
Phylogenetic Network (.graphml) Visualized in Cytoscape. Core GCFs form dense, disconnected subgraphs.

Principles of Phylogenetic Network Construction

The BiG-SCAPE network is a phylogenetic similarity network, not a strict phylogenetic tree. It visualizes complex evolutionary relationships, including horizontal gene transfer, recombination, and module shuffling in modular BGCs (e.g., PKSs).

Key Principles:

  • Network as a Forest: The full network is a "forest" where each disconnected subgraph represents a distinct GCF.
  • Edge Weight as Evolutionary Signal: Thicker, shorter edges indicate higher similarity (lower distance), suggesting closer evolutionary relationships.
  • Core-Periphery Structure: Within a GCF subgraph, a dense "core" of highly similar BGCs is surrounded by a "periphery" of more divergent relatives, illustrating the family's radiation.
  • Hybrid Nodes/Bridges: BGCs that connect two distinct GCFs may represent evolutionary intermediates, ancestral states, or hybrids, highlighting the reticulate evolution of biosynthetic pathways.

G cluster_GCF1 Gene Cluster Family 1 cluster_GCF2 Gene Cluster Family 2 Core1_1 Core BGC A Core1_2 Core BGC B Core1_1->Core1_2 Core1_3 Core BGC C Core1_1->Core1_3 Periph1_1 Peripheral BGC Core1_1->Periph1_1 Core1_2->Core1_3 Periph1_2 Divergent BGC Core1_2->Periph1_2 Hybrid Hybrid/ Bridge BGC Periph1_2->Hybrid Core2_1 Core BGC D Core2_2 Core BGC E Core2_1->Core2_2 Periph2_1 Peripheral BGC Core2_1->Periph2_1 Hybrid->Periph2_1

BiG-SCAPE Phylogenetic Network Structure

Application Notes & Protocols

Protocol 1: Standard BiG-SCAPE Run for GCF Analysis Objective: Generate Gene Cluster Families from a set of BGCs. Input: Directory containing AntiSMASH GenBank files (.gbk).

  • Installation: conda create -n bigscape -c bioconda bigscape
  • Activate Environment: conda activate bigscape
  • Run Core Algorithm:

  • Output Interpretation:
    • Navigate to output/network_files/. The *.graphml file is the main network for visualization in Cytoscape/Gephi.
    • The mix and other folders contain BGCs not assigned to major classes.
    • The html folder contains summary pages showing GCF statistics.

Protocol 2: Phylogenetic Network Visualization and Interpretation in Cytoscape Objective: Visualize and analyze the BiG-SCAPE network.

  • Import: In Cytoscape, import the *.graphml file (File > Import > Network from File).
  • Apply Layout: Use a force-directed layout (e.g., "Prefuse Force Directed") to cluster similar nodes.
  • Style Nodes:
    • Color: Map BGC type (e.g., PKS, NRPS) to node fill color.
    • Size: Map BGC length (number of genes/domains) to node size.
  • Style Edges:
    • Width: Map weight (similarity) to edge width. Higher weight = thicker line.
    • Color: Use a gradient (e.g., gray to black) for weight or a single muted color.
  • Analysis: Use Cytoscape's built-in tools to find highly connected nodes (hub BGCs), identify dense clusters (GCF cores), and analyze network topology (Tools > NetworkAnalyzer).

Table 2: Key Reagents & Resources for BiG-SCAPE-Driven Research

Item Function in BiG-SCAPE Context
antiSMASH Database Source of Input Data. Provides pre-computed BGC annotations in GenBank format for thousands of microbial genomes, enabling large-scale BiG-SCAPE analysis.
Pfam Database Core Algorithm Dependency. Provides the Hidden Markov Model (HMM) profiles used by BiG-SCAPE to identify and compare protein domains within BGCs. Essential for distance calculation.
Cytoscape Network Visualization & Analysis. The primary platform for visualizing BiG-SCAPE's .graphml output. Enables interactive exploration of GCFs, calculation of network statistics, and preparation of publication-quality figures.
MIBiG (Minimum Information about a BGC) Reference & Validation Database. A curated repository of experimentally characterized BGCs. Used to annotate and "ground truth" GCFs predicted by BiG-SCAPE by identifying known clusters within networks.
CORASON Complementary Phylogenetic Tool. Often used downstream of BiG-SCAPE. Constructs detailed phylogenetic trees of specific GCFs based on core biosynthetic proteins, providing higher-resolution evolutionary insights.
Python & SciPy Stack Execution & Customization Environment. BiG-SCAPE is a Python script. Custom analysis of output tables (*_tsv/*.csv) requires libraries like pandas, NumPy, and Matplotlib for further data mining and figure generation.

G Genomes Microbial Genomes AntiSMASH antiSMASH Annotation Genomes->AntiSMASH GBK_Files BGC GenBank Files (.gbk) AntiSMASH->GBK_Files BiGSCAPE BiG-SCAPE Core Algorithm GBK_Files->BiGSCAPE Dist_Matrix Pairwise Distance Matrix BiGSCAPE->Dist_Matrix Pfam Pfam HMM Database Pfam->BiGSCAPE Domain Profiles Clustering Hierarchical Clustering Dist_Matrix->Clustering Networks Phylogenetic Networks (.graphml) Clustering->Networks GCFs Gene Cluster Families (GCFs) Clustering->GCFs Cytoscape Cytoscape Visualization Networks->Cytoscape GCFs->Cytoscape Annotate

BiG-SCAPE Analysis Workflow

Within the broader thesis on genome mining for novel therapeutics, BiG-SCAPE (Biosynthetic Gene Similarity Clustering and Prospecting Engine) analysis is a cornerstone methodology. It addresses the critical challenge of navigating the vast genomic landscape to efficiently prioritize biosynthetic gene clusters (BGCs) for experimental characterization. Researchers in drug discovery rely on it to systematically classify, compare, and dereplicate BGCs encoding secondary metabolites, such as antibiotics, antifungals, and anticancer agents, thereby accelerating the identification of novel chemical scaffolds.

Application Notes: Core Use Cases in Drug Discovery

1. BGC Dereplication & Novelty Assessment Prior to costly and time-consuming heterologous expression or cultivation, researchers use BiG-SCAPE to determine if a newly sequenced BGC is novel or a known variant. By comparing query BGCs against databases (e.g., MIBiG), it clusters them into Gene Cluster Families (GCFs), instantly filtering out rediscovered clusters.

2. Targeted Isolation of Analogs & Derivatives When a potent compound with undesirable properties (e.g., toxicity, solubility) is discovered, BiG-SCAPE identifies other BGCs within the same GCF that likely produce structural analogs. This enables a targeted search for derivatives with improved pharmacological profiles.

3. Ecological & Taxonomic Mapping of Chemical Space BiG-SCAPE analysis across diverse microbial genomes links chemical potential (GCFs) to taxonomy and ecology. This allows hypothesis-driven discovery, such as focusing on specific bacterial phyla from unique environments for antibiotic discovery.

Table 1: Quantitative Impact of BiG-SCAPE Analysis in Representative Studies

Study Focus # of BGCs Analyzed # of GCFs Identified Key Outcome Reference Context
Marine Actinomycetes 1,200+ ~150 Reduced target BGCs for characterization by >90% via dereplication. [Recent Metagenomic Study]
Streptomyces Pan-genome 5,800 320 Identified 15 unique GCFs associated with a specific clade, guiding isolation. [Comparative Genomics Paper]
Fungal Genome Mining 850 45 Discovered 3 new GCFs; one led to a novel antifungal scaffold. [Mycological Research Journal]

Experimental Protocols

Protocol 1: Standard BiG-SCAPE Workflow for BGC Prioritization

Objective: To cluster BGCs from genomic or metagenomic assemblies and prioritize novel GCFs for downstream drug discovery pipelines.

Materials & Input:

  • Input Data: GenBank files of predicted BGCs (e.g., from antiSMASH analysis).
  • Software: BiG-SCAPE (installed via Conda/docker) and CORASON (for phylogeny).
  • Hardware: Multi-core Linux server recommended for large datasets.

Procedure:

  • BGC Prediction: Generate GenBank files for all BGCs of interest using antiSMASH v7+. Retain the *region*.gbk files.
  • BiG-SCAPE Run: Execute the core clustering command:

  • Output Analysis:
    • Navigate to output_dir/network_files/. The file mix_0.30_c1.00.network contains GCF assignments.
    • Cross-reference GCFs with the MIBiG reference database (included in run) to flag known clusters.
    • Identify GCFs containing no MIBiG reference members as high-priority novel targets.
  • CORASON Analysis (Optional but Recommended): For high-interest GCFs, run CORASON to generate detailed sequence similarity networks and phylogenetic trees, confirming homology and guiding primer design for cloning.

Protocol 2: Integrating BiG-SCAPE with Metabolomics Data

Objective: To correlate GCFs with analytical chemistry data (e.g., LC-MS) to identify the metabolite produced by an orphan GCF.

Procedure:

  • Perform BiG-SCAPE analysis on BGCs from a strain library (e.g., 100 actinomycete isolates).
  • For each strain, generate an ethyl acetate extract and analyze via high-resolution LC-MS.
  • Use tools like GNPS (Global Natural Products Social Molecular Networking) to create molecular networks from MS/MS data.
  • Overlay GCF data: Strains sharing a specific GCF should cluster together in the molecular network. Unique molecular features in that sub-network are strong candidates for the metabolite produced by the GCF.
  • Target isolation efforts on these candidate ions.

Diagrams

bigscape_workflow Genomes Genomes antiSMASH antiSMASH Genomes->antiSMASH .gbk files BiG_SCAPE BiG_SCAPE antiSMASH->BiG_SCAPE BGCs GCFs GCFs BiG_SCAPE->GCFs Clustering Prioritization Prioritization GCFs->Prioritization Analysis Exp_Char Experimental Characterization Prioritization->Exp_Char Novel Targets Dereplication Dereplication (Save Resources) Prioritization->Dereplication Known Clusters

BiG-SCAPE Analysis & Prioritization Workflow

GCF_strategy GCF_A GCF A (Known: Vancomycin) Dereplicate Dereplicate GCF_A->Dereplicate Ignore/Low Priority GCF_B GCF B (1 Known + 10 Unknown) Target Target GCF_B->Target Isolate Analogs GCF_C GCF C (All Unknown) Clone Clone GCF_C->Clone Highest Priority Novel Scaffold

Decision Tree for Prioritizing GCFs

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Materials for BiG-SCAPE-Centric Research

Item Function in BiG-SCAPE/Downstream Workflow
antiSMASH Database Core tool for initial BGC prediction from genome sequences; generates the essential input files for BiG-SCAPE.
Pfam Database Required by BiG-SCAPE for domain annotation and sequence comparison. Must be downloaded separately.
MIBiG Reference Dataset Integrated into BiG-SCAPE; the gold-standard repository for known BGCs, enabling automatic dereplication.
Conda/Bioconda Recommended package manager for seamless installation of BiG-SCAPE and all its complex dependencies (e.g., HMMER, DIAMOND).
GNPS Platform Web-based mass spectrometry ecosystem for correlating GCFs (genotype) with chemical data (phenotype) via molecular networking.
PCR Reagents & Kits For amplifying and cloning prioritized orphan BGCs based on CORASON-guided primer design for heterologous expression.
Expression Hosts (e.g., S. albus) Engineered bacterial chassis for expressing cloned BGCs from unculturable or slow-growing source organisms.

How to Run BiG-SCAPE: A Step-by-Step Workflow from Input to Network Visualization

This protocol details the setup of the Python computational environment required for the analysis of Biosynthetic Gene Cluster (BGC) families using BiG-SCAPE within the broader thesis research. A reproducible and correctly configured environment is critical for the generation of accurate and comparable Gene Cluster Family (GCF) networks, which form the basis for downstream natural product discovery and drug development pipelines.

System Prerequisites & Quantitative Specifications

The following table summarizes the minimum and recommended hardware/software requirements based on current BiG-SCAPE and dependency documentation.

Table 1: System and Core Software Prerequisites

Component Minimum Specification Recommended Specification Purpose/Rationale
Operating System Linux x86_64 / macOS 10.14+ Linux distribution (Ubuntu 22.04 LTS) Core dependencies (e.g., HMMER) are UNIX-oriented.
CPU 4 cores 16+ cores Parallel processing for all-vs-all BGC comparison.
RAM 16 GB 64+ GB Handling large multiple sequence alignments and network data.
Storage (Free) 50 GB 500 GB+ SSD For input GenBank files, intermediate PFAM data, and output networks.
Python Version 3.8 Version 3.9 or 3.10 Core runtime for BiG-SCAPE and auxiliary scripts.
Java Runtime JRE 11 JRE 17 Required for utilities like GenomeTools.

Core Dependency Installation Protocol

Step-by-Step System-Level Dependency Installation

This protocol ensures all non-Python binaries required by BiG-SCAPE are available.

Protocol 1: Installing Core Bioinformatics Tools via Conda

  • Install Miniconda by downloading the latest Linux 64-bit bash installer.

  • Create and activate a new Conda environment named bigscape.

  • Install the essential bioinformatics packages via Bioconda channels.

  • Verify installations:

Python Environment and Package Installation

This protocol installs BiG-SCAPE and its direct Python dependencies, ensuring version compatibility.

Protocol 2: Setting Up the Python Virtual Environment and BiG-SCAPE

  • Within the activated Conda environment (bigscape), upgrade core Python packages.

  • Install BiG-SCAPE and its Python dependencies directly from PyPI.

  • Install additional Python packages for data analysis and visualization within the thesis workflow.

  • Verify the installation by checking the BiG-SCAPE help menu.

Table 2: Key Python Dependencies and Their Functions

Package Version Range Role in BiG-SCAPE Workflow
BiG-SCAPE ≥1.1.0 Main orchestration script for BGC processing, comparison, and network generation.
NumPy ≥1.19.0 Efficient numerical computations for distance matrix calculations.
Biopython ≥1.78 Parsing GenBank files, handling sequence I/O.
Scikit-learn ≥0.24.0 Optional; used for advanced clustering analyses of GCFs.
Matplotlib ≥3.3.0 Generation of static GCF network visualizations.

Workflow Visualization: From BGCs to Gene Cluster Families

bigscape_workflow cluster_prereqs Prerequisites & Environment start Input GenBank Files (Containing BGCs) step1 1. BGC Detection & Preprocessing (Prodigal, Domain Prediction) start->step1 step2 2. All-vs-All Comparison (PFAM Distance Calculation) step1->step2 step3 3. Distance Matrix Generation step2->step3 step4 4. Network Clustering (MCL Algorithm) step3->step4 step5 5. GCF Network Visualization (.network files, .svg) step4->step5 output Output: Gene Cluster Families (GCFs) & Sequence Files step5->output py Python 3.8+ hmmer HMMER Suite conda Conda Environment pkgs NumPy, Biopython

BiG-SCAPE Computational Workflow

The Scientist's Toolkit: Essential Research Reagent Solutions

Table 3: Key Computational Research Reagents

Item/Category Specific Solution/Software Function in BGC Analysis
BGC Prediction Engine antiSMASH (v7.0+) Primary tool for identifying and annotating BGCs in genomic data prior to BiG-SCAPE analysis.
Domain Database Pfam database (v35.0+) Provides hidden Markov models (HMMs) for protein domain identification, the basis for BGC similarity scoring.
Clustering Algorithm MCL (Markov Clustering) Integrated into BiG-SCAPE to partition the similarity network into discrete Gene Cluster Families.
Sequence Aligner MUSCLE (v3.8+) Aligns protein domain sequences for accurate distance calculation between BGCs.
Visualization Suite Cytoscape (v3.9+) Used for advanced, interactive exploration and customization of the generated GCF networks.
Runtime Environment Conda Environment Isolates and manages all software versions to guarantee reproducibility across research deployments.

Thesis Context This protocol is a foundational component of a thesis employing BiG-SCAPE for the large-scale analysis of Biosynthetic Gene Cluster (BGC) families. Consistent and correctly formatted input is critical for generating reliable distance matrices and constructing meaningful gene cluster family networks. This document details the preparation of the two primary input types for BiG-SCAPE: GenBank files and antiSMASH JSON results.

GenBank File Preparation

Protocol: Standardizing Annotated GenBank Files for BiG-SCAPE

Objective: To convert or curate GenBank (.gbk) files from genome sequencing projects into a format compatible with BiG-SCAPE analysis.

Materials & Reagents:

  • Source Genomes: Assembled and annotated bacterial, fungal, or other microbial genomes (FASTA, GFF, or proprietary formats).
  • Annotation Software: Prokka, Bakta, or NCBI’s PGAP for prokaryotes; Funannotate for eukaryotes.
  • Sequence Manipulation Tools: Biopython, EMBOSS suite, or SeqKit.
  • Validation Tool: BiG-SCAPE’s built-in checks (run with --help or --test).

Procedure:

  • Annotation Generation:

    • If starting from an assembled genome (FASTA) and annotation (GFF), use a tool like bcbio or a custom Biopython script to generate a standard GenBank file.
    • Example Command (using Prokka):

      The primary output genome_01.gbk is suitable for the next steps.

  • File Format Verification:

    • Ensure the file is in valid GenBank flat file format. The header should contain the LOCUS keyword, and features should be in the FEATURES table.
    • Validate using Biopython:

  • Locus Tag Standardization (Critical):

    • BiG-SCAPE uses the /locus_tag qualifier within CDS features to group and identify genes. Every coding sequence must have a unique locus_tag.
    • If absent, assign a systematic locus tag (e.g., STC_00001, STC_00002). A Biopython script can automate this.
  • BGC Delimitation (For pre-defined clusters):

    • If submitting a specific BGC rather than a whole genome, the GenBank file must contain only the region of the BGC.
    • Extract the BGC region using coordinates. Ensure all features within the region are preserved and the sequence is linear.
    • Update the LOCUS line to reflect the new length and assign a clear identifier (e.g., Streptomyces_coelicolor_ASM_ccoelicolor_cluster1).
  • Final Validation for BiG-SCAPE:

    • Place all curated .gbk files in a single directory (e.g., gbk_files/).
    • Run a preliminary BiG-SCAPE command to test for parsing errors:

    • Address any error messages regarding file format or missing qualifiers.

Table 1: Essential Qualifiers in GenBank CDS Features for BiG-SCAPE

Qualifier Presence Requirement Description & Format Example
locus_tag Mandatory Unique identifier for the gene. locus_tag="SCO0001"
translation Recommended Amino acid sequence of the CDS. translation="MKLF..."
product Recommended Functional description. product="polyketide synthase"
gene Optional Gene name. gene="pksA"

antiSMASH JSON Results Preparation

Protocol: Direct Integration of antiSMASH Output for Enhanced Analysis

Objective: To utilize the detailed, structured output from antiSMASH as direct input for BiG-SCAPE, enabling incorporation of substrate predictions and module-level information.

Materials & Reagents:

  • antiSMASH Software: Version 6.0 or higher (latest stable release recommended).
  • Input for antiSMASH: GenBank file(s) of whole microbial genomes or contigs.
  • Computational Resources: Sufficient CPU/Memory for antiSMASH run (cluster recommended for batches).

Procedure:

  • Run antiSMASH Analysis:

    • Analyze your GenBank file(s) using antiSMASH with the --json flag to generate the necessary JSON output.
    • Example Command:

    • This generates a .json file (e.g., genome_01.json) in the output directory alongside other files.

  • Locate and Consolidate JSON Files:

    • The primary JSON result file is located in the antiSMASH output directory for each input file.
    • For batch processing, copy all relevant .json files into a single, dedicated directory (e.g., json_files/).
    • Important: Do not modify the internal structure of the JSON files.
  • Validate JSON Integrity:

    • Ensure files are valid JSON and contain the expected antiSMASH data structure (e.g., records, areas, modules).
    • Quick validation using jq:

  • Input for BiG-SCAPE:

    • When running BiG-SCAPE, specify the directory containing the JSON files using the --json flag.
    • Example BiG-SCAPE Command with JSON Input:

    • BiG-SCAPE will parse the JSON files to extract BGC information, including potentially detailed domain architecture from the modules section.

Table 2: Comparison of BiG-SCAPE Input File Types

Feature GenBank (.gbk) Input antiSMASH JSON (.json) Input
Primary Use Case Pre-defined BGCs; custom annotations; non-antiSMASH pipelines. Direct integration of antiSMASH results.
Information Richness Basic: Sequence, CDS locations, product names. High: Includes substrate predictions (PKS/NRPS), TFBS, SMCOG profiles, module boundaries.
BGC Detection Relies on user-defined boundaries in the file. Uses antiSMASH-defined cluster boundaries.
Best For Curated datasets, combining clusters from multiple detection tools. Leveraging full predictive power of antiSMASH; homogeneity in analysis.

Visualizations

Diagram 1: Input Preparation Workflow for BiG-SCAPE

workflow Start Start: Assembled Genome (FASTA) Sub1 Annotation & GenBank Creation Start->Sub1 GFF Annotation File (GFF3) GFF->Sub1 GBK_Raw GenBank File (.gbk) Sub1->GBK_Raw Sub2 antiSMASH Analysis JSON antiSMASH Results (.json) Sub2->JSON GBK_Raw->Sub2 Curate Curate & Format GBK_Raw->Curate JSON_Dir Final JSON Files (Directory) JSON->JSON_Dir Extract Extract BGC Region (Optional) Curate->Extract GBK_Final Final GBK Files (Directory) Extract->GBK_Final BiGSCAPE BiG-SCAPE Analysis GBK_Final->BiGSCAPE JSON_Dir->BiGSCAPE

Diagram 2: Data Flow from Sources to BiG-SCAPE Core

dataflow DB Public Databases (NCBI, JGI) GBK_Input GenBank Files DB->GBK_Input SeqLab Sequencing Core Lab SeqLab->GBK_Input Custom Custom Annotation Pipeline Custom->GBK_Input JSON_Input antiSMASH JSON Files GBK_Input->JSON_Input antiSMASH processing BGC_Domains BGC Domain Detection GBK_Input->BGC_Domains Path A JSON_Input->BGC_Domains Path B (Enriched) Dist_Matrix Distance Matrix Calculation BGC_Domains->Dist_Matrix Network Gene Cluster Family Network Dist_Matrix->Network

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Digital "Reagents" for Input Preparation

Item Category Function in Protocol
Prokka Annotation Pipeline Rapid prokaryotic genome annotation; generates compliant GenBank files from assemblies.
antiSMASH 6+ BGC Detection Engine Predicts BGCs, boundaries, and chemical features; outputs structured JSON for BiG-SCAPE.
Biopython Programming Library Python toolkit for parsing, editing, and validating GenBank/FASTA files programmatically.
jq Command-line Tool Processes and validates JSON files from antiSMASH in Unix/bash environments.
SeqKit Sequence Toolkit Efficiently manipulates (extract, reformat) FASTA/GenBank sequences via command line.
BiG-SCAPE --test Validation Function Built-in mode to check input file compatibility before full analysis run.

Application Notes: Core BiG-SCAPE Parameters

BiG-SCAPE (Biosynthetic Gene Similarity Clustering and Prospecting Engine) automates the analysis of Biosynthetic Gene Clusters (BGCs) by calculating pairwise distances and generating Gene Cluster Families (GCFs). The following tables summarize the primary command-line parameters.

Table 1: Core Execution & Input/Output Parameters

Parameter Default Value Description Function
--inputdir ./ Path to directory containing BGCs (GenBank files). Defines the source of BGC data for analysis.
--outputdir ./ Path for BiG-SCAPE results directory. Sets location for all output files (e.g., network, SVG).
--pfam_dir ./ Path to directory containing Pfam database. Essential for domain annotation using HMMER.
--cores 8 Number of CPU cores to use. Controls parallel processing for speed.
--include_singletons N/A (flag) No argument needed. Includes BGCs with no significant similarity in the network.
--mibig N/A (flag) No argument needed. Includes MIBiG reference BGCs in the analysis.

Table 2: Distance Calculation & Clustering Parameters

Parameter Default Range Description
--cutoffs 0.3 0.0 - 1.0 Minimum Jaccard index similarity for edge inclusion in network. Multiple values (e.g., 0.2 0.5 0.9) can be specified.
--mix N/A (flag) N/A Enables analysis of "mixed" clusters (e.g., NRPS-T1PKS).
--hybrids N/A (flag) N/A Enables dedicated hybrid prediction mode (experimental).
--clans-off N/A (flag) N/A Disables Pfam clan unification, treating domains individually.
--banned none File path File listing Pfam IDs to exclude from analysis (e.g., common false positives).
--class auto auto, all, or specific classes Limits analysis to BGCs of a specific class (e.g., PKSI, NRPS).

Experimental Protocol: Running a Standard BiG-SCAPE Analysis

This protocol details the steps for a complete BiG-SCAPE run to generate GCFs from a set of BGC GenBank files.

Objective: To calculate domain-based distances between BGCs and cluster them into families using the BiG-SCAPE pipeline.

Materials & Pre-requisites:

  • Input Data: BGC predictions in GenBank format (.gbk), typically from antiSMASH.
  • Software: BiG-SCAPE v1.1.5 or later (installed via Conda: conda create -n bigscape -c bioconda bigscape).
  • Database: Pfam-A.hmm database (v35.0 or later). Downloaded via wget and prepared with hmmpress.
  • System: Linux/macOS with Python 3 and required dependencies.

Procedure:

  • Data Preparation:
    • Place all BGC GenBank files (.gbk) in a single directory (e.g., my_bgcs/).
    • Ensure the Pfam database is prepared in a known directory (e.g., pfam/).
  • Command Execution:

    • Activate the BiG-SCAPE environment: conda activate bigscape.
    • Execute the core analysis with recommended parameters:

  • Output Interpretation:

    • ./results/network_files/: Contains the core output. File mix_0.30_c0.60.network (for the 0.6 cutoff) is used for downstream analysis.
    • ./results/[date]_bigscape/: Contains HTML overview, SVG network visualizations, and GCF tables.
    • Analyze the .tsv and .network files in tools like Cytoscape or using custom scripts.
  • Validation & Downstream Analysis:

    • Verify clustering by inspecting the sequence similarity (Jaccard index) reported in the network file.
    • Cross-reference GCFs with known MIBiG clusters (if --mibig was used) for functional insights.
    • Proceed with CORASON analysis for detailed phylogenetic exploration within GCFs.

Visualizations

G Start Input BGCs (GenBank files) DOMAIN Domain Annotation (hmmscan) Start->DOMAIN PfamDB Pfam-A.hmm Database PfamDB->DOMAIN MATRIX Generate Domain Bit Score Matrix DOMAIN->MATRIX DIST Pairwise Distance Calculation (Jaccard Index) MATRIX->DIST NET Network Creation (edges at cutoffs) DIST->NET CLUST Clustering (MCL algorithm) NET->CLUST OUT Output: Network & GCFs CLUST->OUT

Title: BiG-SCAPE Core Analysis Workflow

G CMD bigscape.py INPUT Input Parameters --inputdir --pfam_dir --class --mix --hybrids --mibig --banned CMD->INPUT  defines data source DIST Distance Parameters --cutoffs --clans-off --domain_includelist CMD->DIST  controls similarity SYS System Parameters --cores --verbose --help CMD->SYS  manages resources OUT Output Parameters --outputdir --include_singletons CMD->OUT  sets results location

Title: BiG-SCAPE Command-Line Parameter Categories

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials & Tools for BiG-SCAPE Analysis

Item Function in Analysis Notes/Source
antiSMASH Generates input GenBank files from genomic data. Primary BGC prediction tool. https://antismash.secondarymetabolites.org
Pfam-A.hmm Curated database of protein domain families. Used by BiG-SCAPE for domain annotation. Downloaded from EMBL-EBI (ftp.ebi.ac.uk).
HMMER (hmmscan) Software for scanning protein sequences against Pfam HMMs. Core dependency of BiG-SCAPE. http://hmmer.org
MCL Algorithm Markov Clustering algorithm used by BiG-SCAPE to cluster the similarity network into GCFs. Bundled within BiG-SCAPE.
Cytoscape Network visualization and analysis software. Used to explore and refine the .network output. https://cytoscape.org
MIBiG Database Repository of known BGCs. Used as a reference set to annotate and contextualize novel GCFs. Enabled via --mibig flag.
CORASON Complementary tool for detailed phylogenetic analysis of specific GCFs identified by BiG-SCAPE. https://github.com/nselem/corason

Application Notes

Within the thesis "Expanding the Biosynthetic Landscape: A BiG-SCAPE-Driven Analysis of Bacterial Gene Cluster Families for Novel Drug Discovery," the interpretation of computational outputs is a critical bridge between raw data and biological insight. The following notes detail the core components generated by BiG-S-SCAPE and CORASON, essential for GCF analysis.

Network File Analysis (.network)

The network file, typically visualized in tools like Cytoscape, represents the similarity relationships between BGCs. Each node is a BGC, and edges connect BGCs with a pairwise similarity score above a defined cutoff.

Key Metrics for Interpretation:

  • Node Degree: High-degree "hub" BGCs may represent widely conserved, functionally important clusters.
  • Edge Weight: The Jaccard similarity index (0-1). Edges with weights >0.7 often indicate BGCs belonging to the same GCF.
  • Network Modularity: High modularity (Q value >0.4) suggests well-defined, distinct GCFs.

GCF Table and Classification

BiG-SCAPE clusters BGCs into Gene Cluster Families (GCFs) based on the network. The mix folder output contains the primary classification table (main_clusters_<cutoff>.txt).

Table 1: Quantitative Summary of a Hypothetical BiG-SCAPE Run (MIBiG v4.0 Reference Database)

Output Metric Value Interpretation
Input BGCs 1,250 Genomes/MAGs analyzed.
Predicted GCFs (0.7 cutoff) 89 Core families for downstream analysis.
Singleton BGCs 310 Unique or highly divergent clusters.
Largest GCF 42 members Potential widespread, conserved biosynthetic machinery.
GCFs with MIBiG hit 56 (63%) Families with known product potential.
"Orphan" GCFs 33 (37%) High-priority targets for novel compound discovery.

Sequence Data and CORASON Analysis

For phylogenomic analysis of specific GCFs, CORASON (CORe Analysis of syntenic orthologues) is used. It drills down into the core biosynthetic genes of a GCF.

Core Outputs:

  • Core Gene Phylogeny: A Newick tree file showing evolutionary relationships based on aligned core enzyme sequences (e.g., PKS KS domains, NRPS A domains).
  • Synteny Visualization: A graphical map (PNG/PDF) showing the order and conservation of core genes across all BGCs in the GCF, confirming true homology beyond sequence similarity.

Protocols

Protocol 1: Running BiG-SCAPE for GCF Network Generation

Objective: To generate a global network of BGC similarity from a set of GenBank files.

Materials & Reagent Solutions:

  • Input Data: GenBank (.gbk) files for BGCs predicted by antiSMASH.
  • BiG-SCAPE Installation: Conda environment with BiG-SCAPE v1.1.5 or higher.
  • pfam Database: Version 35.0 or current, for domain annotation.
  • Computational Resources: Multi-core server (≥16 cores) with ≥64 GB RAM for large datasets.

Methodology:

  • Prepare Input: Place all GenBank files in a single directory (e.g., my_bgcs/).
  • Run BiG-SCAPE Core:

  • Execute: The run takes several hours to days. Outputs include the network file, GCF tables, and sequence files in the mix folder.

Protocol 2: CORASON Analysis for GCF Phylogenomics

Objective: To perform a detailed core gene alignment and synteny analysis for a single GCF of interest.

Materials & Reagent Solutions:

  • GCF Sequence Files: The .fasta files for the GCF from BiG-SCAPE's mix folder.
  • Reference Core Gene: A seed sequence (e.g., a known KS domain from a MIBiG reference BGC).
  • CORASON Scripts: Available from the BiG-SCAPE repository.
  • BLAST+ Suite: Locally installed for sequence similarity searches.

Methodology:

  • Identify Target GCF: From the main_clusters_<cutoff>.txt table, select a GCF ID (e.g., GCF_0012).
  • Locate Sequences: Navigate to ./bigscape_output/mix/mibig_gbks_c0.70/GCF_0012/ for the FASTA files.
  • Run CORASON:

  • Interpret Output: The results/ folder will contain the core gene alignment (core_alignment.fasta), phylogenetic tree (core_tree.nwk), and the synteny plot (synteny.pdf).

The Scientist's Toolkit

Table 2: Essential Research Reagent Solutions for BiG-SCAPE/CORASON Workflows

Item Function in Analysis
antiSMASH v7.0+ Predicts BGC boundaries and provides annotated GenBank files as primary input for BiG-SCAPE.
Pfam Database Provides hidden Markov models (HMMs) for protein domain annotation, the basis for BGC similarity calculation.
Cytoscape v3.10+ Network visualization software for exploring and styling the BiG-SCAPE network (.network and .graphml files).
Newick Utilities / iTOL Tools for visualizing, editing, and annotating the phylogenetic tree files produced by CORASON.
MIBiG Database Repository of known BGCs. Essential for annotating GCFs with known chemical products via the --mibig flag in BiG-SCAPE.
conda / bioconda Package and environment management system for ensuring reproducible installation of all bioinformatics tools.

Visualizations

G Input Input GenBank Files (antiSMASH output) BIGSCAPE BiG-SCAPE Core (Domain Prediction, Pairwise Distance) Input->BIGSCAPE Network Network File (.network/.graphml) BIGSCAPE->Network GCF_Table GCF Classification Table (main_clusters.txt) BIGSCAPE->GCF_Table Cytoscape Cytoscape Network Visualization & GCF Selection Network->Cytoscape GCF_Table->Cytoscape CORASON CORASON Core Gene & Synteny Analysis Cytoscape->CORASON Select GCF ID Output Phylogeny & Synteny Plots for Target GCF CORASON->Output

Workflow for Gene Cluster Family Analysis

G GCF_Node GCF_042 BGC_001 (Streptomyces) BGC_042 (Amycolatopsis) BGC_087 (Salinispora) ... Core Gene\nPhylogeny (Tree) Core Gene Phylogeny (Tree) GCF_Node:f1->Core Gene\nPhylogeny (Tree) GCF_Node:f2->Core Gene\nPhylogeny (Tree) GCF_Node:f3->Core Gene\nPhylogeny (Tree) MIBiG BGC\n(Reference) MIBiG BGC (Reference) MIBiG BGC\n(Reference)->GCF_Node:f0  Similarity > Cutoff Synteny Map\n(CORASON Output) BGC_001 KS AT ACP ... BGC_042 -- KS AT ... BGC_087 KS -- ACP ...

Structure of a GCF and Its Analysis Outputs

This protocol details the use of Cytoscape for the interactive exploration and visualization of Gene Cluster Family (GCF) networks generated by BiG-SCAPE. Within the broader thesis on BiG-SCAPE for gene cluster family analysis, this application note bridges computational genomics with intuitive biological interpretation. The primary goal is to enable researchers to move from static GCF network files to dynamic, annotated visualizations that facilitate hypothesis generation about biosynthetic diversity, horizontal gene transfer, and potential novel bioactive compounds.

Key Research Reagent Solutions

Item Function in Analysis
BiG-SCAPE v1.x Core tool for parsing BGCs (from antiSMASH) and generating GCF networks based on sequence similarity.
Cytoscape v3.10+ Open-source platform for network visualization and analysis. Essential for interactive exploration.
Cytoscape StringApp Plugin to import functional annotation data (e.g., KEGG, GO) from STRING database onto network nodes.
CytoCluster Plugin Provides algorithms (e.g., MCODE, HCL) for detecting highly interconnected sub-networks within the GCF graph.
EnhancedGraphics Plugin Enables advanced visual encoding of node attributes (e.g., BGC type, genome taxonomy) using custom charts.
.network & .jsn files Primary output files from BiG-SCAPE containing the network structure and metadata for import into Cytoscape.
NCBI Taxonomy Database Used to annotate nodes with organismal information, enabling phylogeny-aware network layout.
antiSMASH BGC GenBank files Source files for BGC predictions that contain functional domain information for detailed node styling.

Protocol: From BiG-SCAPE to Cytoscape Network Analysis

Data Preparation & Import

Objective: Import a BiG-SCAPE GCF network into Cytoscape with all associated metadata.

  • Run BiG-SCAPE: Execute BiG-SCAPE on your antiSMASH-derived BGC dataset to generate GCFs.

  • Locate Output Files: Navigate to the ./bigscape_output/network_files/ folder. The key files are:
    • [Mix|Others]_clustering_c0.30.tsv (Network edges)
    • [Mix|Others]_clustering_c0.30.jsn (Node metadata)
  • Import into Cytoscape:
    • Open Cytoscape. Use File → Import → Network from File to select the .tsv edge file.
    • In the import dialog, set Source Column and Target Column to the appropriate identifiers.
    • Immediately after, use File → Import → Table from File to import the .jsn file. Cytoscape will map the metadata to the corresponding nodes.

Network Styling & Annotation

Objective: Visually encode biological properties using Cytoscape's Style panel.

  • Node Color by BGC Product Type:
    • In the Style tab, map the fill color property to the BGC Type column.
    • Use a discrete mapping to assign distinct, color-blind friendly colors (e.g., #EA4335 for NRPS, #4285F4 for PKS, #34A853 for RiPPs).
  • Node Size by BGC Length:
    • Map the size property to the BGC Length column.
    • Use a continuous mapping (e.g., 20-80 pixels) to reflect the size of the gene cluster.
  • Edge Width by Pairwise Similarity Score:
    • Select an edge attribute (e.g., Raw distance or Similarity score).
    • Map the width property to this column, setting higher scores to thicker lines.

Advanced Functional Analysis

Objective: Integrate external functional data and identify subnetworks.

  • Functional Enrichment with StringApp:
    • Install the StringApp via the App Manager.
    • Select nodes of interest (e.g., a specific GCF) and use Apps → STRING → STRING Enrichment to fetch GO or KEGG terms.
    • Overlay significant terms as new node attributes for styling or labeling.
  • Subnetwork Detection with CytoCluster:
    • Install CytoCluster.
    • Run a clustering algorithm (e.g., MCODE) via Apps → CytoCluster.
    • Visually distinguish the resulting clusters by grouping or using a new color mapping.

Table 1: Common BiG-SCAPE Output Metrics for Cytoscape Visualization

Metric Typical Range Description & Visualization Mapping
GCF Size 2 - 500+ BGCs Number of BGCs per family. Map to network cluster density.
Pairwise Similarity Score 0.0 - 1.0 Jaccard index of shared Pfam domains. Map to edge width/opacity.
BGC Length (kb) 10 - 200 kb Physical length of the cluster. Map to node size.
Domain Count 5 - 100+ Number of PFAM domains in a BGC. Map to node border width.
Neighbors in Network 1 - 50+ Node degree centrality. Map to node color saturation or label size.

Table 2: Recommended Cytoscape Visual Mappings for Key GCF Attributes

Biological Attribute Network Element Visual Property Recommended Mapping
BGC Product Type Node Fill Color Discrete (NRPS=#EA4335, PKS-I=#4285F4, etc.)
Taxonomic Class Node Border Color Discrete (Actinobacteria=#34A853, Proteobacteria=#FBBC05)
Similarity (Edge Weight) Edge Width & Opacity Continuous (0.3-5px, 20-100% opacity)
Centrality (Degree) Node Size or Label Continuous (size: 30-100px)
GCF Membership Network Cluster Layout Grouping Force-directed with cluster prefuse.

Protocol for a Comparative GCF Analysis Workflow

Objective: Compare the network topology and content of two related GCFs.

  • Extract Sub-networks:
    • Use the Select → Nodes → By Column Value to select all nodes where GCF ID equals your target GCF (e.g., GCF_001).
    • Create a new network from the selection (File → New → Network → From Selected Nodes, All Edges).
    • Repeat for the second GCF (GCF_002).
  • Apply Consistent Styling: Copy the visual style from the first sub-network and apply it to the second for direct comparison.
  • Calculate Topological Metrics:
    • For each sub-network, use Tools → Analyze Network to calculate metrics like Average Number of Neighbors, Network Diameter, and Clustering Coefficient. Record these in a table.
  • Analyze BGC Type Composition:
    • In the Table Panel, use the Group By function on the BGC Type column for each sub-network. Count the occurrences of each BGC type.
  • Visual Comparison: Arrange the two sub-network views side-by-side in the Cytoscape dashboard to visually assess differences in connectivity and cluster composition.

Diagrams

workflow cluster_cyto Cytoscape Modules Start Genomic Data (Assembly Files) A antiSMASH (BGC Prediction) Start->A B BiG-SCAPE (GCF Network Generation) A->B C Cytoscape Import & Styling B->C D Interactive Network Analysis C->D C1 StringApp (Functional Data) C2 CytoCluster (Module Detection) C3 EnhancedGraphics (Custom Charts) E Hypothesis & Target Selection D->E

Title: BiG-SCAPE to Cytoscape GCF Analysis Workflow

network_viz GCF_Core Core GCF Node NRPS NRPS BGC GCF_Core->NRPS PKSI PKS-I BGC GCF_Core->PKSI PKSother PKS-other BGC GCF_Core->PKSother Hybrid Hybrid BGC GCF_Core->Hybrid RiPP RiPP BGC GCF_Core->RiPP NRPS->Hybrid Hybrid->PKSI SubNet Tightly Connected Subnetwork

Title: Styled GCF Network with BGC Types & Subnetworks

Solving Common BiG-SCAPE Issues: Tips for Performance, Accuracy, and Custom Analyses

Application Notes

Within the thesis research focused on BiG-SCAPE (Biosynthetic Gene Similarity Clustering and Prospecting Engine) for gene cluster family (GCF) analysis, managing computational resources is critical. The exponential growth of genomic data from public repositories like MIBiG and GenBank presents significant challenges in memory usage, storage, and processing time during the analysis of Biosynthetic Gene Clusters (BGCs). These challenges are amplified when constructing large similarity networks of BGCs to infer evolutionary relationships and discover novel chemical diversity.

Core Challenges:

  • Dataset Size: A single run may involve processing tens of thousands of BGCs, each represented by multiple protein sequences (domains).
  • Memory-Intensive Operations: The all-vs-all pairwise comparison of BGCs using the Domain Sequence Similarity (DSS) metric and the subsequent generation of similarity networks are computationally demanding.
  • I/O Bottlenecks: Reading and writing multiple intermediate files (e.g., .gbk, .fasta, .json, .network files) for thousands of BGCs can strain storage systems.

Key Strategies for Resource Management:

  • Pre-filtering and Subsetting: Analyzing specific taxonomic groups or BGC types (e.g., all Streptomyces NRPS clusters) reduces initial load.
  • Leveraging HPC/Cloud Environments: Utilizing cluster computing with job schedulers (SLURM, SGE) and scalable cloud resources (AWS, GCP) is often necessary for production-scale analyses.
  • Parameter Optimization: Adjusting BiG-SCAPE flags such as --cutoffs, --mix, and --clusters-off controls the granularity and computational cost of analysis.
  • Efficient Data Handling: Using high-performance local storage (NVMe SSDs) and managing pipeline intermediates by writing to /tmp (if node-local storage exists) can drastically improve I/O performance.

Protocols

Protocol 1: Scalable BiG-SCAPE Run on a High-Performance Computing Cluster

This protocol outlines a memory- and storage-conscious execution of BiG-SCAPE for large-scale GCF analysis.

Materials & Software:

  • BiG-SCAPE (v1.x or higher)
  • Python 3.7+ with dependencies (hmmer, mafft, prodigal, etc.)
  • HPC cluster with SLURM scheduler
  • Input: Directory containing GenBank files (.gbk) of BGCs

Procedure:

  • Input Preparation:
    • Place all BGC GenBank files in a dedicated directory (e.g., my_bgcs/).
    • Validate file formats using a script to check for common parsing errors.
  • SLURM Job Script Configuration:

    • Create a job script (run_bigscape.slurm) with appropriate resource requests.

  • Job Submission and Monitoring:

    • Submit job: sbatch run_bigscape.slurm
    • Monitor via squeue -u $USER and output log files.
  • Post-Processing:

    • Network files are generated in /path/to/results/network_files/.
    • Use Cytoscape for downstream visualization and analysis of the .network files.

Protocol 2: Memory-Efficient Processing of Intermediate Data

This protocol details handling large intermediate files generated during BiG-SCAPE's domain alignment phase.

Procedure:

  • Monitor Directory: During execution, monitor the domains and jsons folders in the output directory.
  • Compression of DSS Matrices: After a successful run, compress the large JSON files containing pairwise similarity matrices.

  • Selective Archiving: Archive only essential results (network files, PDFs, log files) for long-term storage, deleting raw domain alignment data if necessary.

Data Presentation

Table 1: Computational Resource Requirements for BiG-SCAPE Analysis of Varying Dataset Sizes

Number of BGCs Approx. Input Size Peak Memory Usage (Est.) Suggested CPU Cores Estimated Runtime* Output Directory Size
500 100 - 200 MB 8 - 16 GB 4 2 - 4 hours 1 - 2 GB
5,000 1 - 2 GB 32 - 64 GB 8 - 16 12 - 24 hours 10 - 20 GB
20,000 4 - 8 GB 128+ GB 16 - 32 3 - 7 days 40 - 80 GB

*Runtime varies based on BGC complexity (PKS/NRPS vs. RiPPs) and HPC node performance.

Table 2: Impact of BiG-SCAPE Parameters on Computational Load

Parameter Function Effect on Computation Time Effect on Memory Use Recommendation for Large Datasets
--cutoffs Defines similarity cutoffs for networking More cutoffs = Increased Minor Increase Use defaults (0.5,0.7,0.9)
--mix Allows mixing of BGC types in GCFs Increases clustering steps Moderate Increase Enable for comprehensive analysis
--clusters-off Skips final hybrid clustering Decreases significantly Decreases Use for initial exploratory network runs
--mibig Includes MIBiG reference BGCs Minor Increase Minor Increase Always enable for benchmarking
--mode Alignment mode (global/auto) Global is more intensive Similar Use global for accuracy; auto for speed

Visualizations

G start Input GenBank Files (10,000s of BGCs) parse 1. Parse & Pre-process start->parse dom 2. Domain Prediction (HMMER/Pfam) parse->dom align 3. All-vs-All Alignment (MAFFT) dom->align disk_alert Large I/O dom->disk_alert sim 4. DSS Score Calculation align->sim mem_alert High Memory & CPU align->mem_alert net 5. Network Generation (Edge: DSS > Cutoff) sim->net gcf 6. GCF Clustering (MCL Algorithm) net->gcf out Output: Networks & Gene Cluster Families gcf->out

Title: BiG-SCAPE Workflow with Computational Bottlenecks

G problem Memory Limit Error strat1 Strategy: Reduce Load problem->strat1 strat2 Strategy: Increase Resources problem->strat2 strat3 Strategy: Optimize Process problem->strat3 opt1a Subset Input BGCs (e.g., by taxonomy) strat1->opt1a opt1b Run with --clusters-off for network only strat1->opt1b resolve Successful BiG-SCAPE Run opt1a->resolve opt1b->resolve opt2a Request High-Mem Node (>128GB RAM) strat2->opt2a opt2b Use Cloud Instance (Optimized for Memory) strat2->opt2b opt2a->resolve opt2b->resolve opt3a Ensure --cpus flag matches allocated cores strat3->opt3a opt3b Use fast local storage (/tmp) for I/O strat3->opt3b opt3a->resolve opt3b->resolve

Title: Decision Tree for Resolving Memory Limits in BiG-SCAPE

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Computational Tools & Resources for Large-Scale BiG-SCAPE Analysis

Item Function/Purpose Notes for Resource Management
High-Performance Computing (HPC) Cluster Provides scalable CPU, memory, and parallel job execution. Essential for datasets >5,000 BGCs. Use SLURM/SGE to manage resources.
Cloud Computing Platform (AWS EC2, GCP) Offers on-demand, configurable virtual machines (e.g., memory-optimized instances). Useful when institutional HPC is unavailable. Cost monitoring is critical.
Pfam Database (v35+) HMM database for protein domain detection via HMMER. Required by BiG-SCAPE. Storing on fast local SSD improves performance.
Conda/Bioconda Environment Manages software dependencies (Python, hmmer, mafft, prodigal). Ensures reproducibility and avoids conflicts.
NVMe Solid-State Drive (SSD) High-speed local storage for I/O-intensive operations. Dramatically reduces time spent reading/writing domain files.
Cytoscape (v3.9+) Network visualization and analysis software. For exploring the final .network files; requires GUI/desktop access.
Python Pandas/NumPy For custom post-processing of BiG-SCAPE output tables. Enables filtering, summarizing, and analyzing GCF results.
Singularity/Apptainer Container Containerization of the entire BiG-SCAPE environment. Guarantees portability and consistency across HPC & cloud systems.

Addressing Dependency and Version Conflicts (Python, HMMER, FastTree)

Application Notes

Within the context of a thesis on BiG-SCAPE for gene cluster family analysis, managing dependency and version conflicts is a critical prerequisite for reproducible and scalable research. BiG-SCAPE, a Python-based tool for delineating and analyzing Biosynthetic Gene Cluster (BGC) families, relies on external bioinformatics software, primarily HMMER for profile Hidden Markov Model searches and FastTree for phylogenetic inference. Inconsistent versions of these dependencies across computing environments (e.g., local workstations, high-performance computing clusters, cloud instances) can lead to silent errors, divergent numerical outputs, and ultimately, non-reproducible family networks.

Key Conflict Points:

  • Python Environment: BiG-SCAPE is compatible with Python 3.6-3.9. Using Python 3.10+ without dependency validation can cause installation failures.
  • HMMER Suite: Changes in HMMER's output format (e.g., between versions 3.1 and 3.3) can break BiG-SCAPE's parsing logic for hmmscan results.
  • FastTree: The transition from FastTree to FastTree 2 (for higher accuracy) is standard, but version discrepancies in the MPI parallel implementation can cause runtime crashes.
  • System Libraries: Underlying C libraries (e.g., GLIBC) on Linux systems can create "version `GLIBC_2.XX' not found" errors for pre-compiled binaries.

The following table summarizes core compatibility matrices and common conflict symptoms:

Table 1: BiG-SCAPE Dependency Compatibility and Conflict Manifestations

Software Component Recommended Version Known Incompatible Versions Symptom of Conflict
BiG-SCAPE Core 1.1.5 (stable) N/A Baseline for analysis.
Python 3.7, 3.8 <3.6, >3.9 (untested) SyntaxError, package installation failures via pip.
HMMER 3.3.2 3.0, 3.1 (format differences) BiG-SCAPE fails to parse domain information; empty network files.
FastTree 2.1.11 (Double precision) <2.0, OpenMPI variants mismatch Segmentation fault, uninterpretable tree branch lengths.
MPI for FastTree OpenMPI 4.0.5 Intel MPI (without compatibility layer) mpirun fails to launch FastTree processes.
System GLIBC >=2.17 Variable (older HPC systems) "Floating point exception" on binary execution.

Experimental Protocols

Protocol 1: Isolated Environment Setup Using Conda

Objective: Create a conflict-free, reproducible environment for BiG-SCAPE and its dependencies.

Materials:

  • Unix-based system (Linux/macOS)
  • Miniconda or Anaconda distribution installed
  • BiG-SCAPE source code (from GitHub)

Methodology:

  • Create a new Conda environment:

  • Install core dependencies via Conda channels:

  • Verify installations:

  • Install BiG-SCAPE in development mode:

Protocol 2: Validation of HMMER Output Parsing

Objective: Confirm that the installed HMMER version produces output compatible with BiG-SCAPE's parser.

Materials:

  • Activated Conda environment from Protocol 1.
  • A small, test set of Pfam HMM profiles (e.g., Pfam-A.hmm subset).
  • A multi-FASTA file of predicted protein sequences from a BGC.

Methodology:

  • Prepare a test HMM database:

  • Run hmmscan in tabular (--tblout) and domain (--domtblout) modes:

  • Execute BiG-SCAPE's parsing module directly:

    • Expected Output: A positive integer count of parsed domains.
    • Conflict Indicator: KeyError, IndexError, or a parsed count of zero indicates a format mismatch.
Protocol 3: FastTreeMP Parallel Execution Test

Objective: Ensure the MPI-parallel version of FastTree executes correctly within the environment.

Materials:

  • Activated Conda environment with OpenMPI.
  • A multiple sequence alignment (MSA) file in FASTA or STOCKHOLM format (e.g., from a single gene cluster family).

Methodology:

  • Test MPI infrastructure:

  • Run FastTreeMP on a test alignment:

  • Validate output:

    • Expected Output: A single line beginning with ( (Newick format).
    • Conflict Indicator: No output file, error message concerning MPI ranks, or a truncated file.

The Scientist's Toolkit

Table 2: Essential Research Reagent Solutions for Dependency Management

Item Function & Rationale
Conda/Bioconda Package manager that resolves binary dependencies for bioinformatics software, ensuring compatible versions of HMMER, FastTree, and Python libraries.
Docker/Singularity Containerization platforms to package the entire BiG-SCAPE environment (OS, libraries, tools), guaranteeing absolute reproducibility across any system.
Python virtualenv Lightweight Python environment isolation, useful for managing Python-level dependencies if system binaries (HMMER, FastTree) are globally stable.
Environment File (environment.yml) A YAML file specifying exact versions of all Conda dependencies, enabling one-command environment reconstruction (conda env create -f environment.yml).
CI/CD Pipeline (e.g., GitHub Actions) Automated testing service to run BiG-SCAPE on a small dataset with every code change, immediately detecting dependency conflicts introduced by updates.

Visualizations

G cluster_0 Conda Environment A BiG-SCAPE Core (Python 3.8) B HMMER 3.3.2 (hmmscan) A->B C FastTree 2.1.11 (MPI) A->C E Reproducible Gene Cluster Families B->E Domain Profiles D System Libs (GLIBC >=2.17) C->D requires C->E Phylogeny

Title: BiG-SCAPE Dependency Stack Managed via Conda

workflow Start Start Env Conda Env Created? Start->Env HMMER HMMER Output Parsed? Env->HMMER Yes Conflict1 Version Conflict Env->Conflict1 No FastTree FastTreeMP Executed? HMMER->FastTree Yes Conflict2 Format Conflict HMMER->Conflict2 No Run Run Full BiG-SCAPE FastTree->Run Yes Conflict3 MPI Conflict FastTree->Conflict3 No Success Analysis Success Run->Success

Title: Dependency Conflict Resolution Workflow

Application Notes & Protocols

Within the broader thesis on BiG-SCAPE for gene cluster family (GCF) analysis, parameter optimization is critical for generating biologically relevant and reproducible network outputs. This protocol details the systematic tuning of three core parameters: Domain Similarity Cutoff, MIBiG Reference Dataset inclusion, and Neighborhood Clustering Sensitivity (--cutoffs, --mix, --clust-offset).

1. Protocol: Determining the Optimal Domain Similarity Cutoff

Objective: To establish the most appropriate --cutoffs parameter for balancing the inclusivity of related BGCs with the specificity required for meaningful GCF formation.

Materials & Reagents:

  • High-quality genomic dataset in GenBank format.
  • Preprocessed antiSMASH 6+ results for all genomes.
  • BiG-SCAPE v1.1.5 or later, installed with dependencies (e.g., HMMER, DIAMOND).
  • Reference BGCs from the MIBiG database (v3.1).
  • High-performance computing (HPC) cluster or workstation (>32 GB RAM recommended).

Procedure: A. Baseline Run: Execute BiG-SCAPE with a broad cutoff range.

B. Network Analysis: For each cutoff value, analyze the resulting network (e.g., in network.html). Record key metrics (Table 1). C. GCF Validation: Cross-reference GCFs from each cutoff with known MIBiG BGCs. Assess fragmentation of known clusters versus over-merging of disparate clusters. D. Optimal Selection: Choose the cutoff that maximizes the retention of known BGC families as cohesive GCFs while maintaining a manageable number of singletons.

Table 1: Comparative Analysis of Domain Similarity Cutoff Values

Cutoff Value Total GCFs Singletons Known Clusters in Correct GCF Avg. GCF Size Recommended Use Case
0.3 Low Very Low High (but may over-merge) High Exploratory, broad relationships
0.5 Moderate Low High Moderate General-purpose analysis
0.7 High Moderate Moderate Low High-resolution, focused studies
0.9 Very High High Low (may fragment) Very Low Detecting very close homologs

2. Protocol: Integrating and Weighting MIBiG References (--mix)

Objective: To leverage the curated MIBiG database for annotating and grounding GCF networks, enhancing biological interpretability.

Procedure: A. Full Integration (--mix): Run BiG-SCAPE with the --mix flag. This processes input and MIBiG BGCs equally, allowing reference clusters to seed or join GCFs.

B. Annotation-Only (No --mix): Run BiG-SCAPE without --mix. Use the --mibig flag in a subsequent step or cross-reference GCF IDs with MIBiG post-analysis. This treats MIBiG as an external annotation layer. C. Comparative Assessment: Compare the two outputs. The --mix run will show MIBiG BGCs as integral nodes within GCFs, often pulling similar unknown clusters into better-defined families. The annotation-only approach keeps the input and reference sets distinct.

3. Protocol: Fine-tuning Clustering Sensitivity (--clust-offset)

Objective: To adjust the granularity of the final GCF clustering after the initial sequence similarity network is built.

Background: The --clust-offset parameter influences the Markov Clustering (MCL) inflation algorithm. Higher values increase granularity (more, smaller GCFs); lower values decrease it (fewer, larger GCFs).

Procedure: A. Offset Series Experiment: Execute BiG-SCAPE with a fixed --cutoffs value while varying --clust-offset.

B. Quantitative Benchmark: For each offset, measure clustering metrics relative to a "gold standard" (e.g., a curated set of MIBiG clusters known to be in the same/different families). Calculate metrics like Adjusted Rand Index (ARI) or pairwise precision/recall. C. Visual Inspection: Examine the network files. Identify which offset produces GCFs that best align with biological intuition—e.g., keeping all known variants of a specific NRPS cluster together while separating functionally distinct clusters.

Table 2: Impact of Clustering Offset Parameter

--clust-offset Clustering Behavior Network Appearance Effect on GCFs
1.0 - 2.0 Very Permissive Few, dense hubs Few, large GCFs; potential over-merging
3.0 - 4.0 Moderate (Default) Balanced Recommended starting point
5.0 - 6.0 Aggressive Many, small hubs Many, small GCFs; potential over-splitting

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for BiG-SCAPE Parameter Optimization

Item Function in Protocol
MIBiG Database (JSON) Curated reference standard for BGC annotation and validation of GCF accuracy.
antiSMASH Results Standardized input files containing predicted BGCs and their domain architecture.
BiG-SCAPE Software Core algorithm for calculating pairwise distances and generating GCF networks.
HMMER Suite Underlying tool for sensitive protein domain identification and alignment.
DIAMOND Accelerated BLAST-based tool used for fast protein sequence comparisons.
Python Environment Required runtime for executing BiG-SCAPE and its analysis scripts.
Cytoscape or similar For advanced visualization and analysis of the .network file output.

Visualization: Parameter Optimization Workflow

G cluster_0 Core BiG-SCAPE Process Start Input: antiSMASH BGCs P1 Parameter Tuning Module Start->P1 MIBiG MIBiG Database MIBiG->P1 --mix flag C1 Calculate Pairwise Distances P1->C1 Apply --cutoffs C2 Build Similarity Network C1->C2 P2 MCL Clustering (--clust-offset) C2->P2 End Output: Gene Cluster Families & Network Files P2->End

Diagram: BiG-SCAPE Tuning Workflow

Visualization: Parameter Interdependence Logic

G Cutoff --cutoffs (0.3 - 0.9) Network Network Density Cutoff->Network Primary Driver Mix --mix (Yes/No) Mix->Network Adds Reference Nodes Bio Biological Relevance Mix->Bio Provides Annotation Offset --clust-offset (1.0 - 6.0) GCF GCF Granularity Offset->GCF Directly Controls Network->Offset Input Network->GCF GCF->Bio Must Reflect Natural Groups

Diagram: How Parameters Affect Output

In the context of a broader thesis utilizing BiG-SCAPE for gene cluster family (GCF) analysis, targeted investigation of specific biosynthetic gene cluster (BGC) types is a critical strategy for efficient natural product discovery. BiG-SCAPE’s initial network analysis provides a global view of BGC diversity. Targeted analysis then involves extracting and deeply characterizing clusters of a particular class—such as Nonribosomal Peptide Synthetases (NRPS), Polyketide Synthases (PKS), or Ribosomally synthesized and Post-translationally modified Peptides (RiPPs)—to prioritize leads with higher potential for novel chemistry and bioactivity. This approach streamlines the transition from genomic data to testable hypotheses in drug development pipelines.

Key Quantitative Data from Recent Studies

Table 1: Comparative Metrics for Targeted BGC Analysis Pipelines (2023-2024)

BGC Type Primary Tool(s) Avg. Processing Time (per 100 BGCs) Key Detection Metric (Sensitivity) Common Prioritization Criteria
NRPS antiSMASH, PRISM 4, NRPSpredictor3 45-60 min Adenylation (A) domain specificity prediction (>92%) Substrate novelty, cluster hybridization, core structure diversity
PKS antiSMASH, PKS2, TransATor 50-70 min Ketosynthase (KS) domain phylogeny & specificity KS domain sequence novelty, trans-AT vs cis-AT designation, modularity
RiPPs antiSMASH, RODEO, RiPPMiner 30-50 min Precursor peptide & core peptide recognition (>90%) Post-translational modification (PTM) enzyme repertoire, leader peptide cleavage site
Hybrid antiSMASH, decRiPPter 70-120 min Detection of interleaved NRPS/PKS/RiPP modules Extent of hybridization, presence of rare catalytic domains

Experimental Protocols

Protocol 1: Targeted NRPS/PKS Analysis Post-BiG-SCAPE

Objective: To perform detailed substrate specificity prediction and modular architecture analysis for NRPS/PKS-type GCFs identified by BiG-SCAPE.

Materials:

  • High-performance computing cluster or workstation.
  • Genomic files (GBK/FASTA) for GCF members.
  • BiG-SCAPE v1.2.0+ output network files.
  • Conda environment with antiSMASH v7.1.0, PRISM 4, and NRPSpredictor3.

Methodology:

  • GCF Extraction: From the BiG-SCAPE network (network.tsv), identify the node IDs for your target GCF. Use the --banned and --include options in a secondary BiG-SCAPE run to isolate and extract all GenBank files for that specific GCF.
  • Deep Annotation: Run antiSMASH on the extracted GCF files with the --clusterhmmer, --asf, and --tta flags enabled for full module prediction.
  • Substrate Prediction: For NRPS clusters, parse the nrpspredictor3 results from antiSMASH or run the standalone NRPSpredictor3 tool on the adenylation domain sequences. For PKS clusters, extract KS domain sequences and analyze using the TransATor pipeline or the Integrated Microbial Genomes/Atlas of Biosynthetic gene Clusters (IMG-ABC) KS phylogeny tool.
  • Structure Prediction: Input the antiSMASH results (in JSON format) into the PRISM 4 web server or standalone tool to generate predicted chemical scaffolds.
  • Prioritization: Rank clusters based on: a) Predicted incorporation of rare or novel substrates (e.g., D-amino acids, β-amino acids), b) Unusual colinearity or module skipping events, c) High novelty score from PRISM.

Protocol 2: Targeted RiPP Analysis and Core Peptide Prediction

Objective: To identify and characterize RiPP precursor peptides and their modification enzymes within a RiPP-focused GCF.

Materials:

  • Isolated GCF genomic files from BiG-SCAPE.
  • antiSMASH v7.1.0+ with RiPP modules.
  • RODEO (web server or local installation).
  • Genome neighborhood network visualization (e.g., via Clinker).

Methodology:

  • Initial Detection: Process GCF genomes through antiSMASH with the --rre and --rri flags to maximize RiPP recognition.
  • Precursor Identification: For BGCs identified as RiPP-like (e.g., lanthipeptide, thiopeptide), extract the genomic region and submit the relevant ORF (often a short peptide flanked by modifying enzymes) to RODEO for core peptide prediction and leader peptide analysis.
  • Neighborhood Analysis: Use Clinker to generate a high-identity alignment of the GCF members, visualizing the conservation of the precursor peptide sequence and the modifying enzyme genes (e.g., LanM for lanthipeptides).
  • PTM Profiling: Manually annotate the predicted functions of modification enzymes (e.g., cyclodehydratase for thiazoles, cytochrome P450 for oxidations) using HMM profiles (e.g., from Pfam) against the protein sequences.
  • Prioritization: Rank BGCs based on: a) Conservation of core peptide motif with hypervariable residues, b) Presence of multiple or rare PTM enzymes, c) Phylogenetic novelty of the modifying enzymes relative to known clusters.

Diagrams

G Start BiG-SCAPE Genome Network A Filter by BGC Type (e.g., 'NRPS') Start->A Network File B Extract Target GCF A->B GCF Selection C Deep Annotation (antiSMASH, PRISM) B->C GBK Files D Domain Analysis (NRPSpredictor3, TransATor) C->D Domain Sequences E Chemical Structure Prediction & Ranking D->E Specificity Data End Prioritized BGCs for Heterologous Expression E->End

Targeted Analysis Workflow for NRPS/PKS BGCs

G RiPP_BGC RiPP BGC Leader Peptide Gene Modification Enzyme Genes Transport/Immunity Genes Process 1. RODEO Analysis 2. Motif Detection RiPP_BGC:f1->Process Precursor Sequence RiPP_BGC:f2->Process PTM Enzyme Context Output Predicted Core Peptide Modified Residues (e.g., Lan, Oxazole) Mature Product Structure Process->Output:f0 Process->Output:f1

RiPP Precursor Peptide to Core Structure Analysis

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Reagents and Tools for Targeted BGC Characterization

Item Name Supplier/Catalog (Example) Function in Targeted Analysis
Phusion HF DNA Polymerase Thermo Fisher Scientific (F-530L) High-fidelity PCR for amplifying entire BGCs or specific domains (e.g., A-domains, KS-domains) from genomic DNA for cloning or sequencing.
Gateway BP/LR Clonase II Thermo Fisher Scientific (11789020) Enzyme mix for efficient, site-specific recombination cloning of large, complex BGCs into heterologous expression vectors (e.g., pCAP01).
E. coli GB05-dir Competent Cells Lucigen (60402-1) Specialized E. coli strains for direct cloning of large, methylated DNA fragments (e.g., directly from Streptomyces genomic DNA).
pCAP01 Cosmid Vector Addgene (Vector #140300) A Streptomyces-E. coli shuttle cosmid for capturing and expressing large BGCs (up to ~50 kb) in heterologous hosts.
Streptomyces albus J1074 DSMZ (Culture Slant # 40447) A common, genetically tractable, and low-background metabolite heterologous host for expressing cloned BGCs from Actinobacteria.
Ni-NTA Superflow Resin Qiagen (30410) Immobilized metal affinity chromatography resin for purifying His-tagged recombinant proteins (e.g., individually expressed NRPS adenylation domains) for in vitro biochemical assays.
S-Adenosyl-L-methionine (SAM) Sigma-Aldrich (A7007) Essential methyl donor cofactor for in vitro assays with methyltransferase enzymes commonly found in RiPP and PKS pathways.

Within the broader thesis on BiG-SCAPE for gene cluster family analysis research, this document addresses a critical advanced capability: extending the core algorithm to recognize biosynthetic gene cluster (BGC) classes beyond its default database. The default Pfam-based HMM profiles in BiG-SCAPE and antiSMASH are powerful but may miss novel or highly divergent biosynthetic logic. This protocol details the methodology for integrating custom Hidden Markov Model (HMM) profiles and leveraging them to define novel BGC classes, thereby enhancing the resolution and discovery potential of genomic mining studies relevant to natural product drug discovery.

Application Notes: Rationale and Workflow Integration

Integrating custom HMMs allows researchers to:

  • Target specific enzyme families (e.g., novel condensases, decorator enzymes) not fully captured by Pfam.
  • Define "fingerprint" domains for a putative novel BGC class of interest.
  • Refine family boundaries within BiG-SCAPE network outputs, creating more phylogenetically coherent Gene Cluster Families (GCFs).

The process integrates into the BiG-SCAPE workflow as follows:

G Workflow for Custom HMM Integration Start Starting Point: Novel Enzyme/Domain of Interest MSABuild 1. Build Multiple Sequence Alignment Start->MSABuild HMMBuild 2. Build Custom HMM (hmmbuild) MSABuild->HMMBuild DBMerge 3. Merge with BiG-SCAPE Pfam DB HMMBuild->DBMerge RunBiGSCAPE 4. Run BiG-SCAPE with --pfam_dir Custom DB DBMerge->RunBiGSCAPE Network 5. Analyze Network: New GCFs & Associations RunBiGSCAPE->Network NovelClass 6. Define Novel BGC Class Rules & Annotations Network->NovelClass

Experimental Protocols

Protocol 3.1: Creating a Custom HMM Profile

Objective: Generate a high-quality HMM from a set of homologous protein sequences.

  • Sequence Curation: Collect protein sequences of the target domain. Sources include:
    • In-house sequenced BGCs.
    • Public databases (NCBI, UniProt) using sensitive searches (JackHMMER, PSI-BLAST).
    • Output from tools like gecco or DeepBGC.
  • Multiple Sequence Alignment (MSA):
    • Use MAFFT (--auto mode) or ClustalOmega.
    • Command: mafft --auto input_sequences.fasta > alignment.aln
    • Manually inspect/trim alignment to the core conserved domain region.
  • HMM Construction:
    • Use the hmmbuild command from the HMMER suite.
    • Command: hmmbuild --amino custom_profile.hmm alignment.aln
  • Calibration & Null Model: (Recommended for rigorous use)
    • Command: hmmpress custom_profile.hmm (This creates compressed profile files).

Protocol 3.2: Integrating Custom HMMs into BiG-SCAPE Analysis

Objective: Replace the default Pfam database with an enhanced version containing custom profiles.

  • Prepare the Custom Pfam Directory:
    • Copy the entire default Pfam directory used by BiG-SCAPE/antiSMASH (e.g., pfam).
    • Place the custom_profile.hmm file into this directory.
    • Edit the Pfam-A.hmm.dat file: Append a new entry following the standard format:

  • Run BiG-SCAPE with the Custom Database:
    • Use the --pfam_dir flag to point to your modified directory.
    • Command:

  • Validation: Check the logs/ directory. The hmmsearch step should list your custom profile. The final network .network file will contain domain counts for your custom HMM.

Protocol 3.3: Defining a Novel BGC Class from Network Results

Objective: Establish rules for a novel BGC class based on the co-occurrence of custom and core Pfam domains.

  • Identify Candidate GCF: From the BiG-SCAPE network, isolate a GCF that is enriched for your custom HMM signature.
  • Domain Architecture Analysis: Extract all domains from member BGCs using the domains output file. Calculate frequency and co-occurrence.
  • Establish Class Rules: Define a logical predicate. Example rule for a novel hybrid class "NRPS-Terpene-X":
    • Rule: (NRPS_Core_Domain >= 2) AND (Terpene_synthase == 1) AND (Custom_HMM >= 1)
  • Implement Rule for Future Prediction: Script-based filtering of BiG-SCAPE domains files or integration into a tool like antiSMASH via a custom detection rule.

Data Presentation: Quantitative Analysis of Custom HMM Impact

Table 1: Impact of Custom NRPS Condensase HMM on GCF Partitioning

Analysis Condition Total GCFs GCFs Containing Target Max GCF Size Avg. Domains/GCF Novel Segregated Clusters
Default BiG-SCAPE 142 15 45 18.7 0 (Baseline)
With Custom HMM 156 23 38 16.2 8

Data from thesis Chapter 4: Analysis of 500 Actinobacterial genomes.

Table 2: Domain Composition of Novel BGC Class "NRPS-X"

Pfam/HMM ID Domain Name Occurrence Frequency (%) in Novel Class (n=23) Frequency in Background (%)
PF00109 AMP-binding 100 12.1
PF00668 PCP 100 10.8
Custom_01 X-Condensase 100 0.5
PF00975 MTase 78 4.3

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials and Tools for Custom HMM Integration

Item/Category Specific Product/Software Function & Explanation
Sequence Database NCBI NR, UniProt, MIBiG Source for homologous sequences to build robust MSAs.
Alignment Tool MAFFT v7.520, Clustal Omega Generates accurate MSAs from divergent sequences.
HMM Engine HMMER Suite (v3.4) Core software for building (hmmbuild) and searching (hmmsearch) HMM profiles.
BGC Analysis Suite BiG-SCAPE (v2.0), antiSMASH (v7.1) Platform for running the analysis with custom databases.
Custom HMM Database In-house curated .hmm files The novel profiles defining new domain biology.
Scripting Environment Python 3.10+ with Biopython, Pandas For automating database merging, parsing results, and defining class rules.
Visualization Cytoscape (v3.10), Graphviz For analyzing and refining BiG-SCAPE network graphs.

Logical Pathway for Novel Class Definition

The decision process from HMM integration to class definition follows this logic:

G Logic for Novel BGC Class Definition A Custom HMM Integrated? B Does it define a new GCF? A->B Yes E Return to HMM refinement A->E No (Failed) C Is domain architecture consistent & novel? B->C Yes B->E No (No Impact) D Establish formal class rules C->D Yes C->E No (Architecture not unique)

BiG-SCAPE vs. Other Tools: Benchmarking Performance and Defining Its Unique Niche

Application Notes

antiSMASH (antibiotics & Secondary Metabolite Analysis Shell) is the established standard for the in silico prediction and annotation of Biosynthetic Gene Clusters (BGCs). However, a significant post-prediction challenge remains: understanding the evolutionary and functional relationships between the thousands of BGCs identified across genomes. This is where the BiG-SCAPE (Biosynthetic Gene Cluster Similarity & Classification and Phylogenetic Estimation) pipeline provides its critical complementary role.

While antiSMASH excels at the detection of individual BGCs, BiG-SCAPE operates at the family level, grouping predicted BGCs into Gene Cluster Families (GCFs) based on conserved domain architecture and sequence similarity. This transition from a single cluster to a family-level view enables researchers to prioritize BGCs for discovery, infer phylogenetic relationships, and explore the genomic basis of chemical diversity. This protocol is framed within a thesis on BiG-SCAPE's role in structuring the global BGC landscape for natural product discovery and rational genome mining.

Quantitative Data Summary: antiSMASH vs. BiG-SCAPE Outputs

Table 1: Core Function Comparison

Tool Primary Function Key Input Key Output Analysis Scope
antiSMASH BGC Prediction & Annotation Genome/Contig (FASTA) Annotated BGCs (GenBank, JSON) Single Genome
BiG-SCAPE BGC Clustering & Networking antiSMASH results (GenBank) Gene Cluster Families (GCFs), Network Files Multi-Genome, Pangenome

Table 2: Typical BiG-SCAPE Run Metrics (Example Dataset)

Parameter Value
Input BGCs (from antiSMASH) 1,250
Processed Product Classes (e.g., NRPS, PKS, RiPPs) 8
Pairwise Comparisons Computed ~780,000
Final Gene Cluster Families (GCFs) Identified 142
Singleton BGCs (not grouped) 68
Average BGCs per GCF 8.3
Runtime (with --mix option) 4.5 hours

Protocol: From antiSMASH Prediction to BiG-SCAPE Family Analysis

Protocol 1: Generating Input Data with antiSMASH

  • Input Preparation: Collect genomic data in FASTA format (complete genomes or assembled contigs).
  • antiSMASH Execution: Run antiSMASH locally or via the web server.

  • Output Curation: For each analyzed genome, locate the GenBank file (.gbk) for each predicted BGC within the antiSMASH output directory. These files contain the essential sequence and domain annotation data required by BiG-SCAPE.

Protocol 2: Clustering BGCs into Families with BiG-SCAPE

  • Environment Setup: Install BiG-SCAPE and its dependencies (HMMER3, FastTree, mafft) via Conda.

  • Input Organization: Place all antiSMASH-derived GenBank files (.gbk) into a single directory (e.g., gbks/). Subdirectories are allowed.
  • Running BiG-SCAPE: Execute the core clustering and analysis.

  • Output Interpretation: Navigate to the bigscape_output/network_files/ directory. The file Network_Annotations_Full.tsv details BGC-to-GCF affiliations. Use Cytoscape to visualize the *.graphml network files, where nodes are BGCs and edges represent similarity scores above the defined cutoff. Clusters of interconnected nodes represent GCFs.

Workflow Diagram

G Genomes Genomic Data (FASTA Files) antiSMASH antiSMASH Prediction & Annotation Genomes->antiSMASH 1. Input BGCs Individual BGCs (GenBank Files) antiSMASH->BGCs 2. Extract BiG_SCAPE BiG-SCAPE Clustering & Networking BGCs->BiG_SCAPE 3. Collect GCFs Gene Cluster Families (GCFs & Networks) BiG_SCAPE->GCFs 4. Process Analysis Prioritization & Phylogenetic Analysis GCFs->Analysis 5. Interpret

Title: From Genomes to Gene Cluster Families

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Computational Tools & Resources

Item Function & Relevance
antiSMASH DB Precomputed BGC predictions for public genomes; provides rapid access or validation data.
MIBiG Database Reference repository of known BGCs; crucial for annotating and grounding novel GCFs.
Cytoscape Open-source platform for visualizing and interacting with BiG-SCAPE's network output files.
Pfam & dbCAN2 HMMs Hidden Markov Model databases used by both antiSMASH (detection) and BiG-SCAPE (domain alignment).
Conda/Bioconda Package manager for creating reproducible, contained software environments for these tools.
Jupyter Notebooks Facilitates interactive data analysis, visualization, and documentation of the workflow.

Analysis Pathway Diagram

G GCF Gene Cluster Family (GCF) Prioritize Prioritization Criteria? GCF->Prioritize Exp_Discovery Heterologous Expression (Discovery) Prioritize->Exp_Discovery Novel Structure Evo_Analysis Evolutionary Analysis (Mechanism) Prioritize->Evo_Analysis Unique Phylogeny Dereplication Genomic Dereplication Prioritize->Dereplication Known Product

Title: Downstream Analysis Pathways for a GCF

1. Introduction & Thesis Context Within the broader thesis research on leveraging BiG-SCAPE for genome mining and gene cluster family (GCF) analysis, a critical evaluation of the available computational ecosystem is required. This analysis positions BiG-SCAPE relative to key alternative and complementary tools—notably PRISM, ClustScan, antiSMASH, and ARTS—to delineate their respective niches, strengths, and limitations in natural product discovery pipelines.

2. Tool Overview and Quantitative Comparison

Table 1: Core Feature Comparison of Major GCF Analysis Tools

Feature / Tool BiG-SCAPE PRISM 4 ClustScan (CLUSEAN) antiSMASH ARTS 2.0
Primary Purpose GCF network analysis & classification Combinatorial structure prediction & mapping Rule-based cluster detection/annotation Comprehensive cluster detection & annotation Resistance gene-guided cluster prioritization
Input AntiSMASH GBK files, GenBank files GenBank files, nucleotide sequences GenBank files Genome sequences/contigs, GenBank files GenBank files, genome assemblies
Core Algorithm Pairwise distance (Jaccard index, DDS), MCL clustering Rule-based, graph genome assembly HMM-based domain detection, rule-based logic HMM-based detection of core biosynthetic genes HMM & CRISPR-based detection of resistance genes
Key Output Network files (.network), GCF classifications Predicted chemical structures, modified peptides Annotated modules and clusters Annotated cluster regions with borders Cluster regions ranked by resistance gene evidence
Visualization Cytoscape-compatible networks Chemical structure diagrams, linear maps Linear cluster maps Interactive HTML page with detailed maps HTML report with highlighted regions
Integration Downstream of antiSMASH Standalone, can use antiSMASH input Standalone pipeline Upstream of BiG-SCAPE, CORASON Integrates with antiSMASH
Quantitative Metric ~80-95% accuracy in GCF grouping* >70% structure prediction precision for known classes* ~85% domain annotation accuracy* >90% sensitivity for major BGC classes* >5-fold enrichment in active strains*
Strengths Exceptional at large-scale phylogeny & relationship mapping Unique structure prediction, retrobiosynthesis Detailed module-level annotation, rule-based Industry standard for detection, user-friendly Excellent for prioritization & novelty filtering
Weaknesses No chemical prediction, requires pre-called clusters Computationally intensive, less accurate for novel folds Less updated, lower detection sensitivity GCF analysis requires external tools (BiG-SCAPE) Narrow focus on resistance-based prioritization

*Representative published or benchmarked performance estimates.

3. Detailed Application Notes and Protocols

3.1 Protocol A: Standard BiG-SCAPE Workflow for GCF Census Objective: Generate a global GCF network from a set of microbial genomes.

  • Input Preparation: Run antiSMASH (v7+) on all target genomes using default parameters. Collect all GenBank (.gbk) output files into a single directory (./antismash_results/).
  • BiG-SCAPE Execution:

  • Output Analysis: Navigate to ./bigscape_output/network_files. Visualize the mix_original_c0.30.network file in Cytoscape. Use the *_clustering_c0.30.tsv file to obtain GCF membership for each BGC.

3.2 Protocol B: Integrated Prioritization using BiG-SCAPE and ARTS Objective: Identify GCFs with high novelty and self-resistance potential.

  • Run antiSMASH with ARTS: Execute antiSMASH with the ARTS plugin enabled to pre-filter clusters.

  • Filter for High-Score Clusters: From the ARTS output, select BGCs with a total ARTS score > 5 (threshold adjustable). Extract their corresponding GenBank files.
  • Focused BiG-SCAPE Analysis: Run BiG-SCAPE using only the high-scoring ARTS BGCs and the MIBiG database as input.

  • Triangulate: Identify GCFs that are both distant from MIBiG references (in BiG-SCAPE network) and contain strong ARTS resistance evidence.

3.3 Protocol C: Comparative Annotation with PRISM and ClustScan Objective: Gain complementary chemical and module-level insights for a specific GCF of interest.

  • Isolate BGC Representatives: Select 3-5 representative GenBank files for a single GCF identified by BiG-SCAPE.
  • PRISM Structure Prediction:

  • ClustScan Annotation:

  • Synthesize Findings: Compare PRISM's predicted chemical scaffold with ClustScan's enzymatic module analysis to hypothesize biosynthetic logic and potential final product diversity.

4. Visualization Diagrams

G Genomes Genomes AntiSMASH AntiSMASH Genomes->AntiSMASH Sequence GBK_Files GBK_Files AntiSMASH->GBK_Files Detects BGCs BiG_SCAPE BiG_SCAPE GBK_Files->BiG_SCAPE Primary Input PRISM PRISM GBK_Files->PRISM Input ClustScan ClustScan GBK_Files->ClustScan Input ARTS ARTS GBK_Files->ARTS Input Networks Networks BiG_SCAPE->Networks GCF Networks Structures Structures PRISM->Structures Predicted Molecules Annotations Annotations ClustScan->Annotations Module Maps ARTS->AntiSMASH Plugin Prioritized Prioritized ARTS->Prioritized Scored BGCs Prioritized->BiG_SCAPE Focused Analysis

Title: Ecosystem of GCF Analysis Tools and Data Flow

workflow Start Genome Assemblies Step1 antiSMASH v7+ (Cluster Detection) Start->Step1 Step2 GBK Files Step1->Step2 Step3 BiG-SCAPE (Distance Calculation) Step2->Step3 Step4 Pairwise Distance Matrix Step3->Step4 Step5 MCL Clustering Step4->Step5 Step6 Gene Cluster Families (GCFs) Step5->Step6 Step7 Cytoscape Network Visualization Step6->Step7 Step8 Hypothesis: Novel GCF / Phylogeny Step7->Step8

Title: BiG-SCAPE Core Protocol Workflow

5. The Scientist's Toolkit: Essential Research Reagents & Solutions

Table 2: Key Computational Reagents for GCF Analysis

Item / Solution Function / Purpose Example / Note
antiSMASH Database Provides HMM profiles for BGC core gene detection. Essential upstream dependency for BiG-SCAPE. Regularly updated.
MIBiG Reference Dataset Gold-standard repository of known BGCs for comparison and distance calibration. Integrated into BiG-SCAPE via --mibig flag for rooting networks.
Pfam & CDD Profiles Protein domain databases for functional annotation of BGC genes. Used by ClustScan and antiSMASH for in-depth annotation.
Cytoscape Software Open-source platform for visualizing and exploring complex networks. Required for interactive analysis of BiG-SCAPE's .network files.
Prodigal Gene-finding software for annotating open reading frames (ORFs). Often used as the default gene caller in antiSMASH pipelines.
HMMER Suite Toolkit for profile Hidden Markov Model searches. The core algorithm behind domain detection in most tools.
BiG-SCAPE Cutoff Parameters Adjustable thresholds (e.g., 0.3, 0.7) for defining GCF relatedness. Critical "reagent" for tuning network granularity and GCF size.
ARTS Resistance Gene HMMs Custom profiles for detecting putative self-resistance elements. The core filtering mechanism of the ARTS prioritization tool.

Application Notes

Within the context of a broader thesis on BiG-SCAPE (Biosynthetic Gene Similarity Clustering and Prospecting Engine) for gene cluster family (GCF) analysis, a critical validation step involves correlating the computationally derived GCF networks with experimentally characterized biosynthetic pathways. This protocol details the methodology for leveraging the MIBiG (Minimum Information about a Biosynthetic Gene cluster) database to annotate and validate BiG-SCAPE network nodes, thereby confirming the chemical and functional coherence of GCFs and prioritizing novel clusters for discovery.

The core principle involves cross-referencing the protein sequence of each gene cluster within a BiG-SCAPE network (node) against the MIBiG repository. A successful correlation confirms that the GCF contains pathways known to produce specific natural products, validating the network's biological relevance. This process transforms a network of sequence similarity into a map of chemical potential, distinguishing known from novel chemical space.

Key Quantitative Correlation Metrics Table 1: Metrics for Evaluating GCF-MIBiG Correlation

Metric Description Interpretation
MIBiG Hit Percentage (Number of nodes with a BLAST hit to MIBiG / Total nodes in GCF) x 100 High percentage indicates a well-characterized GCF. Low percentage suggests novelty.
Average Amino Acid Identity (AAI) Mean AAI from BLASTP of node vs. MIBiG reference cluster. AAI > ~70% often indicates production of the same or highly similar compound.
MIBiG Diversity Index Number of unique MIBiG BGC classes (e.g., NRPS, PKS I, RiPP) represented in a GCF's hits. Low diversity suggests a chemically coherent GCF. High diversity may indicate over-clustering.
Core Biosynthetic Protein Coverage Percentage of core biosynthetic enzymes in the MIBiG reference that have a significant hit within the query node. High coverage (>80%) strengthens confidence in the functional prediction.

Protocol: Correlating BiG-SCAPE GCF Networks with MIBiG

I. Prerequisites and Input Data

  • BiG-SCAPE Output: The network file (network.tsv) and the corresponding GenBank files for all clusters in the analysis.
  • MIBiG Database: Download the latest MIBiG dataset (JSON format and/or protein FASTA files) from https://mibig.secondarymetabolites.org/.
  • Software: BiG-SCAPE, antismash-json tool (from antiSMASH), gbk-to-faa.py (or similar), BLAST+ suite, and a scripting environment (Python/R).

II. Step-by-Step Protocol

Step 1: Extract Protein Sequences from BiG-SCAPE Nodes For each GenBank file corresponding to a node in your BiG-SCAPE network, extract all protein sequences.

Concatenate all extracted protein sequences into a single query FASTA file.

Step 2: Prepare the MIBiG Reference Database Convert the MIBiG JSON entries into a non-redundant protein sequence database.

Step 3: Perform Homology Search Execute a BLASTP search of your query proteins against the MIBiG database.

Step 4: Parse Results and Map to GCF Networks Using a custom script (Python example logic):

  • Parse blast_results.tsv.
  • Map each qseqid (query protein ID) back to its source GenBank file (node).
  • Map each sseqid (subject MIBiG protein ID) to its MIBiG BGC entry and compound information.
  • Aggregate hits at the node level: A node is considered a "MIBiG hit" if at least one of its proteins has a significant BLAST match (e.g., evalue < 1e-10, AAI > 30%).
  • Annotate the BiG-SCAPE network file (network.tsv) by adding a column, e.g., MIBiG_Compound, containing the known product name for hit nodes.

Step 5: Analyze and Visualize Correlations

  • Calculate the metrics in Table 1 for each GCF.
  • Generate visualizations: Color nodes in the BiG-SCAPE network graph by their MIBiG-derived compound class or novelty status (Hit/No Hit).

Visualization: GCF-MIBiG Validation Workflow

G BGC_Files Input GenBank Files (BiG-SCAPE Nodes) Extract 1. Extract Protein Sequences BGC_Files->Extract Query_FASTA Query Protein FASTA Extract->Query_FASTA BLAST 3. Run BLASTP Homology Search Query_FASTA->BLAST MIBiG_JSON MIBiG JSON Database PrepareDB 2. Build MIBiG BLAST DB MIBiG_JSON->PrepareDB MIBiG_DB MIBiG Protein Database PrepareDB->MIBiG_DB MIBiG_DB->BLAST Results BLAST Results (TSV Format) BLAST->Results Parse 4. Parse & Map Results to Network Results->Parse Annotated_Net Annotated Network + Metrics Table Parse->Annotated_Net

Diagram 1: Workflow for correlating GCF networks with MIBiG.

Research Reagent & Tool Solutions

Table 2: Essential Toolkit for GCF-MIBiG Validation

Item Function in Protocol Source/Example
BiG-SCAPE (v2.0+) Generates the core GCF network from input BGCs. GitHub Repository
MIBiG Dataset (v3.0+) Gold-standard repository of experimentally characterized BGCs for correlation. MIBiG Website
BLAST+ Suite (v2.13+) Performs the essential homology search between query proteins and MIBiG references. NCBI
antiSMASH-json Tool Utility to extract protein sequences or other data from antiSMASH/MIBiG JSON files. Bundled with antiSMASH
Custom Python/R Scripts For parsing BLAST outputs, mapping hits to network nodes, and calculating summary metrics. Researcher-developed
Cytoscape/ggnetwork Visualization platforms to render the final annotated BiG-SCAPE network, coloring nodes by MIBiG annotation. Open-source software

This document presents detailed Application Notes and Protocols within the context of a broader thesis investigating the use of BiG-SCAPE (Biosynthetic Gene Similarity Clustering and Prospecting Engine) for gene cluster family (GCF) analysis. The primary objective is to establish a robust workflow for the discovery of novel natural product analogs and the systematic prioritization of Biosynthetic Gene Clusters (BGCs) for subsequent heterologous expression, a critical bottleneck in natural product-based drug discovery.

Application Notes: A Integrated Workflow

Core Quantitative Findings from BiG-SCAPE Analysis

The following table summarizes key metrics from a representative BiG-SCAPE analysis of 1,245 bacterial genomes, highlighting the clustering efficiency and novelty detection potential.

Table 1: BiG-SCAPE GCF Analysis Summary (Representative Dataset)

Metric Value Interpretation
Total BGCs Input 4,832 BGCs predicted by antiSMASH.
Gene Cluster Families (GCFs) Formed 347 Groups of related BGCs.
Singleton BGCs 112 BGCs with no close relatives; high novelty potential.
Average BGCs per GCF 13.6 Indicates common structural themes.
Largest GCF (Type I PKS) 89 members A widely distributed cluster type.
GCFs with >50% Unknown Core Biosynthetic Genes 23 High priority for novel chemistry.

Table 2: Prioritization Scoring Matrix for Heterologous Expression

Priority Tier Scoring Criteria (Example) Target BGCs
Tier 1 (Highest) Singleton and high % unknown genes and detected in difficult-to-culture phylum. 15
Tier 2 (High) Member of small GCF (2-5 members) and phylogenetically distant from known producers. 42
Tier 3 (Medium) Core structure predicted to be novel variant of pharmaceutically relevant scaffold (e.g., tetracycline). 108
Tier 4 (Lower) BGC shows high similarity (>80%) to well-characterized clusters. Remaining

Key Research Reagent Solutions & Materials

Table 3: Essential Toolkit for BGC Heterologous Expression

Item Function & Explanation
pCAP01 / pJWV25 Vectors Shuttle vectors for capture and expression of large BGCs in E. coli and Streptomyces.
λ-RED/ET Recombination Kit Enables seamless, recombination-based cloning of large BGC fragments directly from genomic DNA.
Methylation-Free E. coli Strain (e.g., ET12567/pUZ8002) Essential for conjugal transfer of cloned BGCs from E. coli to actinobacterial hosts.
Optimized Streptomyces Host (e.g., S. coelicolor M1152/M1146) Engineered heterologous hosts with reduced native secondary metabolism and improved precursor supply.
CAS Agar Plates Chrome Azurol S assay for rapid detection of siderophore production, a proxy for successful expression.
ISP2/R2YE Media Rich sporulation and production media for Streptomyces cultivation and metabolite extraction.
HPLC-MS with PDA/ELSD Detectors For chemical analysis of culture extracts to detect novel compounds post-expression.

Detailed Experimental Protocols

Protocol: BiG-SCAPE Analysis for Novel Analog Discovery

Objective: To cluster BGCs into GCFs and identify candidates producing novel analogs.

  • Input Preparation: Run antiSMASH (v7.0+) on your genomic/ metagenomic assemblies. Convert all GenBank results to the required input format using the bigscape.py utility.
  • Run BiG-SCAPE: Execute the core command: python bigscape.py -i ./input_dir -o ./output_dir --cutoffs 0.3 0.7 --mibig --mix.
    • --cutoffs: Defines network stringency (0.3 for permissive, 0.7 for strict).
    • --mibig: Includes MIBiG reference clusters for annotation.
    • --mix: Considers all BGC types together.
  • Analyze Network: Open the network.html file in the ./output_dir. Visually inspect the sequence similarity network. Clusters of nodes (BGCs) represent GCFs.
  • Identify Novel Analogs: Focus on:
    • Singletons: Isolated nodes.
    • GCF Edges: GCFs connected by thin lines (low similarity) to known MIBiG clusters (diamond-shaped nodes), suggesting structural analogs.
  • Data Extraction: Use the ./output_dir/Network_Annotations_Full.tsv file. Filter for BGCs where the Known Cluster column is "N/A" or where similarity to known clusters (Similar MIBiG BGCs column) is below your chosen threshold (e.g., <30%).

Protocol: TAR/BAC-Assisted BGC Capture and Heterologous Expression

Objective: To clone a prioritized ~60 kb Type II PKS BGC into a heterologous host.

  • Design Capture Vector:

    • Using the BGC sequence, design ~500 bp homology arms (HA) targeting regions ~2 kb upstream and downstream of the cluster boundaries.
    • Amplify HAs via PCR and clone into a linearized BAC vector (e.g., pIndigoBAC-536) containing an origin of transfer (oriT) and an apramycin resistance marker, using Gibson Assembly.
  • Transformation and Conjugation:

    • Transform the engineered BAC into an E. coli donor strain (e.g., ET12567/pUZ8002) carrying the conjugation machinery.
    • Prepare spores of the expression host Streptomyces albus J1074.
    • Mix donor E. coli and S. albus spores on an MS agar plate and incubate at 30°C for 16-20 hours.
  • Exconjugant Selection:

    • Overlay the conjugation mixture with 1 mL water containing 1 mg nalidixic acid (to counter-select E. coli) and 1 mg apramycin (to select for BAC integration).
    • Incubate at 30°C for 5-7 days until exconjugant colonies appear.
  • Metabolite Production and Analysis:

    • Inoculate exconjugants into TSB liquid medium with apramycin. Incubate at 30°C, 220 rpm for 2 days as seed culture.
    • Transfer seed culture to production medium (e.g., SFM or R2YE). Incubate for 5-7 days.
    • Extract metabolites by adding equal volume of ethyl acetate to culture broth, vortex, and centrifuge. Dry the organic layer under vacuum.
    • Resuspend in methanol and analyze by LC-MS. Compare chromatograms to the wild-type and empty host controls to identify new peaks specific to the cloned BGC.

Mandatory Visualizations

workflow Start Genomic/ Metagenomic Data A antiSMASH BGC Prediction Start->A B BiG-SCAPE Clustering into GCFs A->B C Novelty & Priority Analysis B->C D BGC Selection for Heterologous Expression C->D C->D Scoring Matrix E Capture & Cloning (TAR/BAC, λ-RED) D->E F Conjugal Transfer to Expression Host E->F G Fermentation & Metabolite Extraction F->G H LC-MS Analysis & Novel Compound ID G->H

Title: BGC Discovery to Expression Workflow

scoring Criteria Prioritization Criteria C1 Singleton (BiG-SCAPE) Criteria->C1 C2 % Unknown Biosynthetic Genes Criteria->C2 C3 Phylogenetic Novelty of Host Criteria->C3 C4 GCF Size & Diversity Criteria->C4 C5 Predicted Bioactive Scaffold Criteria->C5 Decision Priority Score & Tier Assignment C1->Decision C2->Decision C3->Decision C4->Decision C5->Decision Action Decision: Proceed to Cloning & Expression Decision->Action

Title: BGC Prioritization Logic Diagram

BiG-SCAPE (Biosynthetic Gene Similarity Clustering and Prospecting Engine) is a pivotal tool for the automated analysis of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs). Within the broader thesis of advancing GCF research for natural product discovery and drug development, a critical understanding of its limitations is essential for appropriate application and data interpretation.

Core Computational and Analytical Limitations

The following table summarizes key quantitative and qualitative constraints identified from current literature and software documentation.

Table 1: Defined Boundaries of BiG-SCAPE Functionality

Limitation Category Specific Constraint Impact on Analysis
Input Dependency Relies entirely on BGC predictions from tools like antiSMASH. Errors or inconsistencies in input BGC delimitation propagate directly. GCF boundaries can be artificially split or merged based on flawed input gene calls or cluster borders.
BGC Detection Cannot de novo predict or identify BGCs from raw genomic data. It is strictly a post-prediction analysis tool. Requires prior, often computationally intensive, genome mining with dedicated prediction software.
Sequence Analysis Scope Operates on protein domains (Pfam) for similarity networking. Does not perform full-length protein or nucleotide sequence alignment. May overlook significant homology or rearrangements in regions outside core conserved domains.
Chemical Inference Cannot predict the chemical structure of the natural product encoded by a BGC or GCF. Links sequence to potential chemistry only indirectly through known domain functions (e.g., adenylation domain specificity).
Regulatory & Context Ignores regulatory genes and genetic context outside the defined BGC (e.g., host transcription factors, regulatory networks). Provides no insight into BGC expression conditions or activation potential.
Evolutionary Modeling Does not construct detailed phylogenetic trees or infer horizontal gene transfer events for GCFs. Clustering is based on direct similarity, not evolutionary history or lineage.
Quantitative Thresholds Default correlation cutoff (e.g., 0.7 for bigscape_core.py) is user-defined and arbitrary; changing it alters GCF composition. GCFs are not absolute biological entities but computational groupings sensitive to parameters.

Experimental Protocol: Validating BiG-SCAPE GCF Predictions

This protocol outlines steps to experimentally probe the biosynthetic potential of a novel GCF identified by BiG-SCAPE, addressing its inability to predict chemical output.

Protocol Title: Heterologous Expression and Metabolite Profiling for a Novel GCF

Objective: To confirm the biosynthetic activity and characterize the chemical product of a BiG-SCAPE-defined GCF that lacks homology to known clusters.

Materials & Reagents:

  • Bacterial Artificial Chromosome (BAC) library constructed from environmental DNA (eDNA) of the source organism(s).
  • E. coli or Streptomyces spp. heterologous expression host (e.g., E. coli BAP1, S. albus J1074).
  • PCR Reagents and GCF-specific primers designed to conserved biosynthetic genes within the GCF.
  • Culture Media: ISP2, R5A, or other appropriate media for expression host growth and secondary metabolism induction.
  • Liquid Chromatography-Mass Spectrometry (LC-MS) system (e.g., UPLC-QTOF).
  • Solvents: HPLC-grade methanol, acetonitrile, ethyl acetate for extraction.

Procedure:

  • GCF-Targeted Screening: Screen the eDNA-BAC library using GCF-specific PCR primers to identify clones carrying the target BGC.
  • Clone Isolation & Sequencing: Isolate positive BAC clones and perform full-length sequencing to confirm the integrity of the inserted BGC.
  • Heterologous Expression: Introduce the purified BAC DNA into the chosen expression host via electroporation or conjugation. Include empty-vector control strains.
  • Cultivation & Induction: Inoculate transgenic and control strains in triplicate into suitable media. Incubate with shaking (e.g., 28°C, 200 rpm for 5-7 days). Consider using chemical elicitors (e.g., N-acetylglucosamine) to potentially induce silent clusters.
  • Metabolite Extraction: Harvest culture broth by centrifugation. Separate supernatant and cell pellet. Extract metabolites from the supernatant using equal volume ethyl acetate (3x). Extract the cell pellet with 70% acetone. Combine and evaporate organic extracts to dryness.
  • LC-MS Analysis: Resuspend extracts in methanol. Analyze by LC-MS alongside controls. Use diode-array detection (DAD) and high-resolution MS (HRMS) for metabolite profiling.
  • Data Analysis: Compare chromatograms (base peak intensity, UV traces) of expressing clones versus controls. Identify unique peaks present only in the expression clone. Use HRMS data to propose molecular formulas for novel compounds.

Visualization: BiG-SCAPE's Position in the Genome Mining Workflow

workflow Start Input Genomes/ Metagenomes AS antiSMASH (BGC Prediction) Start->AS Genomic Data BS BiG-SCAPE & corason (GCF Analysis) AS->BS Predicted BGCs (.gbk files) Down Downstream Analysis BS->Down GCF Network & Files Note1 BiG-SCAPE Scope (BGCs to GCFs) Note1->BS Note2 Limitation Boundary: No *de novo* prediction Note2->AS Note3 Limitation Boundary: No chemical prediction Note3->BS

Title: BiG-SCAPE's Role and Limits in the Analysis Pipeline

The Scientist's Toolkit: Essential Research Reagents & Materials

Table 2: Key Reagent Solutions for GCF Experimental Validation

Item Function in GCF Research
antiSMASH-processed GenBank (.gbk) Files The essential, non-negotiable input for BiG-SCAPE. Contains the domain-annotated BGC predictions.
BiG-SCAPE Pfam Database The curated set of HMMs used to define protein domains; version dictates homology detection.
Heterologous Expression Host A genetically tractable chassis (e.g., S. albus) for expressing silent or refactored BGCs from novel GCFs.
Broad-Spectrum PCR Primers Targeting conserved backbone genes (e.g., ketosynthase, adenylation domains) for GCF-specific screening.
Chemical Elicitors (e.g., Rare Earth Salts) Used in cultivation to potentially activate silent BGCs identified in silico but not expressed in lab conditions.
LC-MS Metabolomics Standards Internal standards and compound libraries for dereplicating known compounds and identifying novel ones.
Cytoscape Software For visualization and further analysis of the network file (.network) output by BiG-SCAPE.

Conclusion

BiG-SCAPE has established itself as an indispensable, community-driven tool that translates raw genomic data into actionable evolutionary insights on biosynthetic potential. By mastering its foundational concepts, methodological workflow, optimization strategies, and understanding its position in the ecosystem of bioinformatics tools, researchers can systematically navigate the vast diversity of BGCs. The key takeaway is that BiG-SCAPE moves beyond single-genome analysis to reveal the global landscape of gene cluster families, dramatically accelerating the targeted discovery of novel pharmaceuticals. Future directions involve tighter integration with metabolomic data, improved user interfaces, and machine learning enhancements for more accurate functional predictions. For biomedical research, this means a more efficient, hypothesis-driven pipeline from microbial genomes to promising clinical leads, ultimately unlocking nature's hidden chemical repertoire.