Output File Formats
This page documents the output files produced by clam commands.
Zarr Format Overview
clam uses Zarr for its intermediate and output files. Zarr is a format for storing chunked, compressed N-dimensional arrays. Zarr enables better compression for storing multisample per-base values (depth in this case) than. Please see this small benchmark for a comparison.
Schema Stability
The Zarr schema (metadata structure, array layout) is not yet stable and may change in future versions of clam. If you are building tools that read clam's Zarr outputs, please reach out via Github issue with any feedback and/or your use case(s).
Depth Zarr Store
clam collect produces a Zarr store containing raw depth values for all samples.
Structure:
depths.zarr/
├── zarr.json # Zarr v3 metadata
├── chr1/ # Per-chromosome arrays
│ └── ...
├── chr2/
│ └── ...
└── ...
Root Metadata (clam_metadata):
{
"contigs": [
{"name": "chr1", "length": 248956422},
{"name": "chr2", "length": 242193529}
],
"column_names": ["sample1", "sample2", "sample3"],
"chunk_size": 1000000
}
| Field | Type | Description |
|---|---|---|
contigs |
array | List of chromosomes with names and lengths |
column_names |
array | Sample names in column order |
chunk_size |
integer | Chunk size in base pairs |
Array Properties:
| Property | Value |
|---|---|
| Data type | uint32 |
| Shape | (chromosome_length, num_samples) |
| Chunk shape | (chunk_size, num_samples) |
| Compression | Blosc (Zstd, level 5, byte shuffle) |
Per-Chromosome Metadata (contig_info):
clam loci Output
Callable Sites Zarr Store (default)
By default, clam loci produces a Zarr store containing callable sample counts per population.
Structure: Same as depth Zarr (see above).
Root Metadata (clam_metadata):
{
"contigs": [
{"name": "chr1", "length": 248956422},
{"name": "chr2", "length": 242193529}
],
"column_names": ["PopA", "PopB", "PopC"],
"chunk_size": 1000000,
"callable_loci_type": "population_counts",
"populations": [
{"name": "PopA", "samples": ["sample1", "sample2", "sample3"]},
{"name": "PopB", "samples": ["sample4", "sample5"]},
{"name": "PopC", "samples": ["sample6", "sample7", "sample8"]}
]
}
| Field | Type | Description |
|---|---|---|
contigs |
array | List of chromosomes with names and lengths |
column_names |
array | Population names in column order |
chunk_size |
integer | Chunk size in base pairs |
callable_loci_type |
string | Type of callable loci data (population_counts) |
populations |
array | Population definitions with sample membership (see below) |
Population metadata:
| Field | Type | Description |
|---|---|---|
name |
string | Population name |
samples |
array | List of sample names in this population |
Array Properties:
| Property | Value |
|---|---|
| Data type | uint16 |
| Shape | (chromosome_length, num_populations) |
| Chunk shape | (chunk_size, num_populations) |
| Compression | Blosc (Zstd, level 5, byte shuffle) |
Data interpretation:
Each value represents the number of samples in that population that are callable at that position. For example, if PopA has 10 samples and the value at position 1000 is 8, then 8 of the 10 samples in PopA are callable at position 1000.
Per-Sample Mask Zarr Store (--per-sample)
When using --per-sample, clam loci outputs a boolean mask indicating callability for each sample.
Root Metadata (clam_metadata):
{
"contigs": [
{"name": "chr1", "length": 248956422}
],
"column_names": ["sample1", "sample2", "sample3"],
"chunk_size": 1000000,
"callable_loci_type": "sample_masks",
"populations": [
{"name": "PopA", "samples": ["sample1", "sample2"]},
{"name": "PopB", "samples": ["sample3"]}
]
}
| Field | Type | Description |
|---|---|---|
contigs |
array | List of chromosomes with names and lengths |
column_names |
array | Sample names in column order |
chunk_size |
integer | Chunk size in base pairs |
callable_loci_type |
string | Type of callable loci data (sample_masks) |
populations |
array | Population definitions with sample membership |
Array Properties:
| Property | Value |
|---|---|
| Data type | bool |
| Shape | (chromosome_length, num_samples) |
| Chunk shape | (chunk_size, num_samples) |
| Compression | PackBits |
Data interpretation:
true: Site is callable for this samplefalse: Site is not callable for this sample
clam stat Output
clam stat produces tab-separated (TSV) files in the specified output directory.
pi.tsv
Nucleotide diversity (π) per population per window.
Always generated.
| Column | Type | Description |
|---|---|---|
chrom |
string | Chromosome name |
start |
integer | Window start position (1-based) |
end |
integer | Window end position (1-based, inclusive) |
population |
string | Population name |
pi |
float | Nucleotide diversity estimate |
comparisons |
integer | Total pairwise comparisons (denominator) |
differences |
integer | Pairwise differences observed (numerator) |
Example:
chrom start end population pi comparisons differences
chr1 1 10000 PopA 0.00234 4500000 10530
chr1 1 10000 PopB 0.00198 3800000 7524
chr1 10001 20000 PopA 0.00241 4600000 11086
Notes:
pi = differences / comparisonscomparisonsincludes both variant and invariant callable sitesNaNindicates no valid comparisons in the window
dxy.tsv
Absolute divergence (dxy) between population pairs.
Generated only when >1 population is defined.
| Column | Type | Description |
|---|---|---|
chrom |
string | Chromosome name |
start |
integer | Window start position (1-based) |
end |
integer | Window end position (1-based, inclusive) |
population1 |
string | First population name |
population2 |
string | Second population name |
dxy |
float | Absolute divergence estimate |
comparisons |
integer | Total between-population comparisons |
differences |
integer | Between-population differences |
Example:
chrom start end population1 population2 dxy comparisons differences
chr1 1 10000 PopA PopB 0.00312 8500000 26520
chr1 1 10000 PopA PopC 0.00287 7200000 20664
chr1 1 10000 PopB PopC 0.00301 6800000 20468
Notes:
- One row per population pair per window
- Population pairs are unordered (
PopA-PopBis the same asPopB-PopA) dxy = differences / comparisons
fst.tsv
Fixation index (FST) between population pairs.
Generated only when >1 population is defined.
| Column | Type | Description |
|---|---|---|
chrom |
string | Chromosome name |
start |
integer | Window start position (1-based) |
end |
integer | Window end position (1-based, inclusive) |
population1 |
string | First population name |
population2 |
string | Second population name |
fst |
float | FST estimate (Hudson estimator) |
Example:
chrom start end population1 population2 fst
chr1 1 10000 PopA PopB 0.156
chr1 1 10000 PopA PopC 0.089
chr1 1 10000 PopB PopC 0.201
Notes:
- FST is calculated using the Hudson estimator: FST = (dxy - πavg) / dxy
- Values range from 0 (no differentiation) to 1 (complete differentiation)
- Negative values can occur when within-population diversity exceeds between-population divergence
NaNindicates undefined FST (e.g., no polymorphic sites)
heterozygosity.tsv
Per-sample or per-population heterozygosity.
Generated only when callable sites are provided.
| Column | Type | Description |
|---|---|---|
chrom |
string | Chromosome name |
start |
integer | Window start position (1-based) |
end |
integer | Window end position (1-based, inclusive) |
sample |
string | Sample name (only in per-sample mode) |
population |
string | Population name |
het_total |
integer | Number of heterozygous sites |
callable_total |
integer | Number of callable sites |
heterozygosity |
float | Heterozygosity rate (het_total / callable_total) |
het_not_in_roh |
integer | Het sites excluding ROH (if --roh provided) |
callable_not_in_roh |
integer | Callable sites excluding ROH (if --roh provided) |
heterozygosity_not_in_roh |
float | Heterozygosity excluding ROH (if --roh provided) |
Example (per-sample mode):
chrom start end sample population het_total callable_total heterozygosity het_not_in_roh callable_not_in_roh heterozygosity_not_in_roh
chr1 1 10000 sample1 PopA 234 9500 0.0246 220 9000 0.0244
chr1 1 10000 sample2 PopA 198 9200 0.0215 198 9200 0.0215
Example (per-population mode):
chrom start end population het_total callable_total heterozygosity het_not_in_roh callable_not_in_roh heterozygosity_not_in_roh
chr1 1 10000 PopA 1856 94500 0.0196 1750 89000 0.0197
chr1 1 10000 PopB 2134 96000 0.0222 2134 96000 0.0222
Notes:
- Per-sample mode is used when callable sites were generated with
--per-sampleinclam loci - Per-population mode is used when callable sites contain population counts (default
clam locioutput)
Output Summary
| Command | Output | Format | Description |
|---|---|---|---|
collect |
*.zarr/ |
Zarr | Raw depth values (uint32) |
loci |
*.zarr/ |
Zarr | Callable counts (uint16) or masks (bool) |
stat |
pi.tsv |
TSV | Nucleotide diversity |
stat |
dxy.tsv |
TSV | Absolute divergence (if >1 pop) |
stat |
fst.tsv |
TSV | Fixation index (if >1 pop) |
stat |
heterozygosity.tsv |
TSV | Heterozygosity (if callable sites provided) |
Working with Outputs
Reading Zarr in Python
import zarr
# Open store
store = zarr.open("callable.zarr", mode="r")
# Get metadata
meta = store.attrs["clam_metadata"]
populations = meta["column_names"]
chunk_size = meta["chunk_size"]
# Read chromosome data
chr1 = store["chr1"][:] # Shape: (chrom_length, num_populations)
# Read specific region (efficient)
region = store["chr1"][1000000:2000000, :]
For more detailed examples of working with Zarr outputs, including exploring depth distributions, see the example notebooks (coming soon).