Biological data toolkit. Sequence analysis, alignments, phylogenetic trees, diversity metrics (alpha/beta, UniFrac), ordination (PCoA), PERMANOVA, FASTA/Newick I/O, for microbiome analysis.
---
name: scikit-bio
description: Biological data toolkit. Sequence analysis, alignments, phylogenetic trees, diversity metrics (alpha/beta, UniFrac), ordination (PCoA), PERMANOVA, FASTA/Newick I/O, for microbiome analysis.
license: BSD-3-Clause license
allowed-tools: Read Write Edit Bash
compatibility: Requires Python 3.10+ and scikit-bio 0.7+ (uv pip install scikit-bio). NumPy 2.0+ is required. Optional matplotlib/seaborn/plotly for plotting; biom-format for BIOM tables; polars/anndata for table interoperability.
metadata:
version: "1.1"
skill-author: K-Dense Inc.
---
# scikit-bio
## Overview
scikit-bio is a comprehensive Python library for working with biological data. Apply this skill for bioinformatics analyses spanning sequence manipulation, alignment, phylogenetics, microbial ecology, and multivariate statistics.
## When to Use This Skill
This skill should be used when the user:
- Works with biological sequences (DNA, RNA, protein)
- Needs to read/write biological file formats (FASTA, FASTQ, GenBank, Newick, BIOM, etc.)
- Performs sequence alignments or searches for motifs
- Constructs or analyzes phylogenetic trees
- Calculates diversity metrics (alpha/beta diversity, UniFrac distances)
- Performs ordination analysis (PCoA, CCA, RDA)
- Runs statistical tests on biological/ecological data (PERMANOVA, ANOSIM, Mantel)
- Analyzes microbiome or community ecology data
- Works with protein embeddings from language models
- Needs to manipulate biological data tables
## Core Capabilities
### 1. Sequence Manipulation
Work with biological sequences using specialized classes for DNA, RNA, and protein data.
**Key operations:**
- Read/write sequences from FASTA, FASTQ, GenBank, EMBL formats
- Sequence slicing, concatenation, and searching
- Reverse complement, transcription (DNA→RNA), and translation (RNA→protein)
- Find motifs and patterns using regex
- Calculate distances (Hamming, k-mer based)
- Handle sequence quality scores and metadata
**Common patterns:**
```python
import skbio
# Read sequences from file
seq = skbio.DNA.read('input.fasta')
# Sequence operations
rc = seq.reverse_complement()
rna = seq.transcribe()
protein = rna.translate()
# Find motifs
motif_positions = seq.find_with_regex('ATG[ACGT]{3}')
# Check for properties
has_degens = seq.has_degenerates()
seq_no_gaps = seq.degap()
```
**Important notes:**
- Use `DNA`, `RNA`, `Protein` classes for grammared sequences with validation
- Use `Sequence` class for generic sequences without alphabet restrictions
- Quality scores automatically loaded from FASTQ files into positional metadata
- Metadata types: sequence-level (ID, description), positional (per-base), interval (regions/features)
### 2. Sequence Alignment
Perform pairwise and multiple sequence alignments using the `pair_align` engine (introduced in scikit-bio 0.7.0), a versatile and efficient dynamic-programming aligner.
**Key capabilities:**
- Global, local, and semi-global alignment (free ends configurable) in one function
- Convenience wrappers `pair_align_nucl` (BLASTN-like) and `pair_align_prot` (BLASTP-like)
- Configurable scoring: match/mismatch tuple or named substitution matrix; linear or affine gap penalties
- `PairAlignPath` results carry CIGAR strings and convert to aligned sequences
- Multiple sequence alignment storage and manipulation with `TabularMSA`
**Common patterns:**
```python
from skbio import DNA, Protein
from skbio.alignment import pair_align_nucl, pair_align_prot, pair_align, TabularMSA
# Nucleotide alignment with BLASTN-like defaults
seq1, seq2 = DNA('ACTACCAGATTACTTACGGATCAGG'), DNA('CGAAACTACTAGATTACGGATCTTA')
aln = pair_align_nucl(seq1, seq2)
aln.score # alignment score (float)
path = aln.paths[0] # PairAlignPath (repr shows CIGAR)
aligned_seqs = path.to_aligned((seq1, seq2)) # list of gapped strings
# Build a TabularMSA from the alignment path + original sequences
msa = TabularMSA.from_path_seqs(path, (seq1, seq2))
# Customize the algorithm via pair_align (default mode='global')
aln = pair_align(seq1, seq2, mode='local') # Smith-Waterman
aln = pair_align(seq1, seq2, sub_score=(2, -3), gap_cost=(5, 2)) # affine gaps
aln = pair_align(seq1, seq2, sub_score='NUC.4.4', gap_cost=3) # substitution matrix, linear gap
# Protein alignment (BLASTP-like, BLOSUM62)
aln = pair_align_prot(Protein('HEAGAWGHEE'), Protein('PAWHEAE'))
# Read a multiple alignment from file and summarize
msa = TabularMSA.read('alignment.fasta', constructor=DNA)
consensus = msa.consensus()
```
**Important notes:**
- `pair_align` replaces the removed SSW wrapper (`local_pairwise_align_ssw`, `StripedSmithWaterman`) and the deprecated pure-Python aligners (`global_pairwise_align`, `local_pairwise_align_nucleotide`, etc.)
- The result is a `PairAlignResult` that also unpacks as `score, paths, matrices` (use `keep_matrices=True` to retain the DP matrix)
- `sub_score` accepts a `(match, mismatch)` tuple or a matrix name (e.g., `'NUC.4.4'`, `'BLOSUM62'`); `gap_cost` accepts a single number (linear) or `(open, extend)` tuple (affine)
- Parse external CIGAR strings with `PairAlignPath.from_cigar('1I8M2D5M2I')`; score an existing alignment with `align_score(...)` and build a distance matrix from an MSA with `align_dists(...)`
### 3. Phylogenetic Trees
Construct, manipulate, and analyze phylogenetic trees representing evolutionary relationships.
**Key capabilities:**
- Tree construction from distance matrices (UPGMA/WPGMA, Neighbor Joining, GME, BME)
- Tree rearrangement with nearest neighbor interchange (`nni`)
- Tree manipulation (pruning, rerooting, traversal)
- Distance calculations (patristic via `cophenet`, Robinson-Foulds via `compare_rfd`)
- ASCII visualization
- Newick format I/O
**Common patterns:**
```python
from skbio import TreeNode
from skbio.tree import nj, upgma, gme, bme, rf_dists
# Read tree from file
tree = TreeNode.read('tree.nwk')
# Construct tree from distance matrix
tree = nj(distance_matrix)
# Tree operations
subtree = tree.shear(['taxon1', 'taxon2', 'taxon3'])
tips = [node for node in tree.tips()]
lca = tree.lca(['taxon1', 'taxon2'])
# Calculate distances
patristic_dist = tree.find('taxon1').distance(tree.find('taxon2'))
cophenetic_dm = tree.cophenet() # patristic distance matrix among tips
# Compare two trees (Robinson-Foulds)
rf_distance = tree.compare_rfd(other_tree)
# Pairwise RF distances among many trees -> DistanceMatrix
rf_dm = rf_dists([tree, other_tree, third_tree])
```
**Important notes:**
- Use `nj()` for neighbor joining (classic phylogenetic method)
- Use `upgma()` for UPGMA/WPGMA (assumes molecular clock)
- GME and BME are highly scalable for large trees; refine topology with `nni()`
- `cophenet()` (formerly `tip_tip_distances`) returns the patristic distance matrix; `compare_rfd()` is the Robinson-Foulds method (`compare_wrfd`/`compare_cophenet` for weighted/cophenetic variants)
- `lca()` is the lowest common ancestor; `lowest_common_ancestor` remains as an alias
- Trees can be rooted or unrooted; some metrics require specific rooting
### 4. Diversity Analysis
Calculate alpha and beta diversity metrics for microbial ecology and community analysis.
**Key capabilities:**
- Alpha diversity: richness (`sobs`, `observed_features`, `chao1`, `ace`), Shannon, Simpson, Hill numbers (`hill`), Faith's PD (`faith_pd`), generalized PD (`phydiv`), Pielou's evenness
- Beta diversity: Bray-Curtis, Jaccard, weighted/unweighted UniFrac, Euclidean distances
- Phylogenetic diversity metrics (require tree input)
- Rarefaction and subsampling
- Integration with ordination and statistical tests
**Common patterns:**
```python
from skbio.diversity import alpha_diversity, beta_diversity
# Alpha diversity (phylogenetic metrics take taxa= for tip-name mapping)
alpha = alpha_diversity('shannon', counts_matrix, ids=sample_ids)
faith_pd = alpha_diversity('faith_pd', counts_matrix, ids=sample_ids,
tree=tree, taxa=feature_ids)
# Beta diversity
bc_dm = beta_diversity('braycurtis', counts_matrix, ids=sample_ids)
unifrac_dm = beta_diversity('unweighted_unifrac', counts_matrix,
ids=sample_ids, tree=tree, taxa=feature_ids)
# Get available metrics
from skbio.diversity import get_alpha_diversity_metrics
print(get_alpha_diversity_metrics())
```
**Important notes:**
- Counts must be integers representing abundances, not relative frequencies
- The phylogenetic-metric argument is `taxa=` (renamed from `otu_ids` in 0.6.0; the old name is a deprecated alias); `observed_otus` is now `observed_features` (or `sobs`)
- `counts_matrix` may be any table-like input (NumPy array, pandas/polars DataFrame, BIOM `Table`, or AnnData) via the dispatch system
- Phylogenetic metrics (Faith's PD, UniFrac) require tree and taxa-to-tip mapping
- Use `partial_beta_diversity()` for specific sample pairs, or `block_beta_diversity()` for large block-decomposed calculations
- Alpha diversity returns a `pandas.Series`, beta diversity returns a `DistanceMatrix`
### 5. Ordination Methods
Reduce high-dimensional biological data to visualizable lower-dimensional spaces.
**Key capabilities:**
- PCoA (Principal Coordinate Analysis) from distance matrices
- CA (Correspondence Analysis) for contingency tables
- CCA (Canonical Correspondence Analysis) with environmental constraints
- RDA (Redundancy Analysis) for linear relationships
- Biplot projection for feature interpretation
**Common patterns:**
```python
from skbio.stats.ordination import pcoa, cca
import skbio
# PCoA from distance matrix (limit dimensions for large matrices)
pcoa_results = pcoa(distance_matrix, dimensions=3)
pc1 = pcoa_results.samples['PC1']
pc2 = pcoa_results.samples['PC2']
# Built-in scatter plot colored by a metadata column
fig = pcoa_results.plot(sample_metadata, column='bodysite')
# CCA with environmental variables
cca_results = cca(species_matrix, environmental_matrix)
# Save/load ordination results
pcoa_results.write('ordination.txt')
results = skbio.OrdinationResults.read('ordination.txt')
```
**Important notes:**
- PCoA works with any distance/dissimilarity matrix; pass `dimensions` as an int (count) or a float in (0, 1] (fraction of cumulative variance to retain)
- `OrdinationResults` exposes pandas-based attributes: `samples`, `features`, `eigvals`, `proportion_explained`, `biplot_scores`, `sample_constraints`
- CCA reveals environmental drivers of community composition
- `OrdinationResults.plot()` produces a matplotlib figure; results also integrate with seaborn/plotly
### 6. Statistical Testing
Perform hypothesis tests specific to ecological and biological data.
**Key capabilities:**
- PERMANOVA: test group differences using distance matrices
- ANOSIM: alternative test for group differences
- PERMDISP: test homogeneity of group dispersions
- Mantel test: correlation between distance matrices
- Bioenv: find environmental variables correlated with distances
- Differential abundance: `ancom`, `dirmult_ttest`, and `dirmult_lme` (longitudinal mixed-effects) in `skbio.stats.composition`
**Common patterns:**
```python
from skbio.stats.distance import permanova, anosim, mantel
# Test if groups differ significantly
permanova_results = permanova(distance_matrix, grouping, permutations=999)
print(f"p-value: {permanova_results['p-value']}")
# ANOSIM test
anosim_results = anosim(distance_matrix, grouping, permutations=999)
# Mantel test between two distance matrices
mantel_results = mantel(dm1, dm2, method='pearson', permutations=999)
print(f"Correlation: {mantel_results[0]}, p-value: {mantel_results[1]}")
# Differential abundance on a feature table (raw counts recommended)
from skbio.stats.composition import dirmult_ttest
da = dirmult_ttest(counts_table, grouping, treatment='caseA', reference='control')
```
**Important notes:**
- Permutation tests provide non-parametric significance testing
- Use 999+ permutations for robust p-values
- PERMANOVA sensitive to dispersion differences; pair with PERMDISP
- Mantel tests assess matrix correlation (e.g., geographic vs genetic distance)
- Supply differential-abundance tests with raw counts, not pre-normalized proportions, to preserve magnitude information
### 7. File I/O and Format Conversion
Read and write 19+ biological file formats with automatic format detection.
**Supported formats:**
- Sequences: FASTA, FASTQ, GenBank, EMBL, QSeq
- Alignments: Clustal, PHYLIP, Stockholm
- Trees: Newick
- Tables: BIOM (HDF5 and JSON)
- Distances: delimited square matrices
- Analysis: BLAST+6/7, GFF3, Ordination results
- Metadata: TSV/CSV with validation
**Common patterns:**
```python
import skbio
# Read with automatic format detection
seq = skbio.DNA.read('file.fasta', format='fasta')
tree = skbio.TreeNode.read('tree.nwk')
# Write to file
seq.write('output.fasta', format='fasta')
# Generator for large files (memory efficient)
for seq in skbio.io.read('large.fasta', format='fasta', constructor=skbio.DNA):
process(seq)
# Convert formats
seqs = list(skbio.io.read('input.fastq', format='fastq', constructor=skbio.DNA))
skbio.io.write(seqs, format='fasta', into='output.fasta')
```
**Important notes:**
- Use generators for large files to avoid memory issues
- Format can be auto-detected when `into` parameter specified
- Some objects can be written to multiple formats
- Support for stdin/stdout piping with `verify=False`
### 8. Distance Matrices
Create and manipulate distance/dissimilarity matrices with statistical methods.
**Key capabilities:**
- Store symmetric (`DistanceMatrix`, hollow diagonal) or general pairwise (`PairwiseMatrix`) data
- ID-based indexing and slicing
- Integration with diversity, ordination, and statistical tests
- Read/write delimited text format
**Common patterns:**
```python
from skbio import DistanceMatrix
import numpy as np
# Create from array
data = np.array([[0, 1, 2], [1, 0, 3], [2, 3, 0]])
dm = DistanceMatrix(data, ids=['A', 'B', 'C'])
# Access distances
dist_ab = dm['A', 'B']
row_a = dm['A']
# Read from file
dm = DistanceMatrix.read('distances.txt')
# Use in downstream analyses
pcoa_results = pcoa(dm)
permanova_results = permanova(dm, grouping)
```
**Important notes:**
- `DistanceMatrix` enforces symmetry and a zero (hollow) diagonal; it is a subclass of `SymmetricMatrix`
- `PairwiseMatrix` (renamed from `DissimilarityMatrix`, which is kept as a deprecated alias) allows general/asymmetric values
- IDs enable integration with metadata and biological knowledge
- Compatible with pandas, numpy, and scikit-learn
### 9. Biological Tables
Work with feature tables (OTU/ASV tables) common in microbiome research.
**Key capabilities:**
- BIOM format I/O (HDF5 and JSON) via the native `Table` class
- Table dispatch system (0.7.0+): functions accept any `table_like` input — BIOM `Table`, pandas/polars DataFrame, NumPy array, or AnnData — without explicit conversion
- Data augmentation techniques (`phylomix`, `mixup`, `aitchison_mixup`, `compos_cutmix`)
- Sample/feature filtering and normalization
- Metadata integration
**Common patterns:**
```python
from skbio import Table
from skbio.diversity import beta_diversity
# Read BIOM table
table = Table.read('table.biom')
# Access data
sample_ids = table.ids(axis='sample')
feature_ids = table.ids(axis='observation')
counts = table.matrix_data
# Filter
filtered = table.filter(sample_ids_to_keep, axis='sample')
# Pass table-like objects directly to scikit-bio drivers (dispatch system)
import pandas as pd
df = pd.read_table('data.tsv', index_col=0) # samples x features
bdiv = beta_diversity('braycurtis', df) # no manual conversion needed
```
**Important notes:**
- BIOM tables are standard in QIIME 2 workflows
- Rows typically represent samples, columns represent features (OTUs/ASVs)
- Supports sparse and dense representations
- With the dispatch system, functions return the same format as their input, or a user-specified output format
### 10. Protein Embeddings
Work with protein language model embeddings for downstream analysis.
**Key capabilities:**
- Store embeddings from protein language models (ESM, ProtTrans, etc.)
- Convert embeddings to distance matrices
- Generate ordination objects for visualization
- Export to numpy/pandas for ML workflows
**Common patterns:**
```python
from skbio.embedding import ProteinEmbedding, ProteinVector
# Create embedding from array
embedding = ProteinEmbedding(embedding_array, sequence_ids)
# Convert to distance matrix for analysis
dm = embedding.to_distances(metric='euclidean')
# PCoA visualization of embedding space
pcoa_results = embedding.to_ordination(metric='euclidean', method='pcoa')
# Export for machine learning
array = embedding.to_array()
df = embedding.to_dataframe()
```
**Important notes:**
- Embeddings bridge protein language models with traditional bioinformatics
- Compatible with scikit-bio's distance/ordination/statistics ecosystem
- SequenceEmbedding and ProteinEmbedding provide specialized functionality
- Useful for sequence clustering, classification, and visualization
## Best Practices
### Installation
```bash
uv pip install scikit-bio
```
Requires Python 3.10+ and NumPy 2.0+. Pre-compiled wheels are published for each release since 0.7.0, so most platforms install without a compiler. Conda users can instead run `conda install -c conda-forge scikit-bio`.
### Performance Considerations
- Use generators for large sequence files to minimize memory usage
- For massive phylogenetic trees, prefer GME or BME over NJ
- Beta diversity calculations can be parallelized with `partial_beta_diversity()`
- BIOM format (HDF5) more efficient than JSON for large tables
### Integration with Ecosystem
- Sequences interoperate with Biopython via standard formats
- Tables integrate with pandas, polars, and AnnData
- Distance matrices compatible with scikit-learn
- Ordination results visualizable with matplotlib/seaborn/plotly
- Works seamlessly with QIIME 2 artifacts (BIOM, trees, distance matrices)
### Common Workflows
1. **Microbiome diversity analysis**: Read BIOM table → Calculate alpha/beta diversity → Ordination (PCoA) → Statistical testing (PERMANOVA)
2. **Phylogenetic analysis**: Read sequences → Align → Build distance matrix → Construct tree → Calculate phylogenetic distances
3. **Sequence processing**: Read FASTQ → Quality filter → Trim/clean → Find motifs → Translate → Write FASTA
4. **Comparative genomics**: Read sequences → Pairwise alignment → Calculate distances → Build tree → Analyze clades
## Reference Documentation
For detailed API information, parameter specifications, and advanced usage examples, refer to `references/api_reference.md` which contains comprehensive documentation on:
- Complete method signatures and parameters for all capabilities
- Extended code examples for complex workflows
- Troubleshooting common issues
- Performance optimization tips
- Integration patterns with other libraries
## Additional Resources
- Official documentation: https://scikit.bio/docs/latest/
- GitHub repository: https://github.com/scikit-bio/scikit-bio
- Changelog: https://github.com/scikit-bio/scikit-bio/blob/main/CHANGELOG.md
- Reference paper: "scikit-bio: a fundamental Python library for biological omic data," *Nature Methods* (2025), https://www.nature.com/articles/s41592-025-02981-z
- Forum support: https://forum.qiime2.org (scikit-bio is part of QIIME 2 ecosystem)
Creator's repository · k-dense-ai/scientific-agent-skills
License: BSD-3-Clause license