Input File Formats
This page documents the file formats accepted by clam.
Depth Files
D4 Files
D4 is a compact format for storing per-base depth information.
Formats accepted:
- Uncompressed D4 (
.d4) - Bgzipped D4 with index (
.d4.gz+.d4.gz.gzi)
Generating D4 files:
# Generate D4 from BAM using mosdepth
mosdepth --d4 sample sample.bam
# Optionally bgzip and index to save disk space
bgzip --index sample.per-base.d4
Sample name extraction:
When using positional arguments, clam extracts sample names from D4 filenames. For a file named sample1.per-base.d4.gz, the sample name is sample1 (the part before the first .).
To use explicit sample names instead, use the --samples option with a samples file.
GVCF Files
GVCF (Genomic VCF) files contain per-sample depth and genotype quality information at every position.
Requirements:
- Must be bgzipped (
.g.vcf.gz) - Must be tabix indexed (
.g.vcf.gz.tbi)
Generating indexed GVCFs:
Sample name extraction:
When using positional arguments, sample names are extracted from the filename. For a file named sample1.g.vcf.gz, the sample name is sample1 (the part before the first .).
To use explicit sample names instead, use the --samples option with a samples file.
VCF Files
For clam stat, input VCF files must be bgzipped and tabix indexed.
Requirements:
- Must be bgzipped (
.vcf.gz) - Must be tabix indexed (
.vcf.gz.tbi)
Samples File
Specifies sample names, input file paths, and optionally population assignments. This is the recommended way to provide input to clam loci and clam collect as it allows explicit control over sample naming.
Format: Tab-separated with header. The sample_name and file_path columns are required; population is optional.
| Column | Required | Description |
|---|---|---|
sample_name |
Yes | Unique sample identifier |
file_path |
Yes | Path to depth file (D4 or GVCF) |
population |
No | Population assignment |
Example with populations:
sample_name file_path population
sample1 /path/to/sample1.d4.gz PopA
sample2 /path/to/sample2.d4.gz PopA
sample3 /path/to/sample3.d4.gz PopB
sample4 /path/to/sample4.d4.gz PopB
Example without populations:
sample_name file_path
sample1 /path/to/abc.sample1.d4.gz
sample2 /path/to/abc.sample2.d4.gz
sample3 /path/to/abc.sample3.d4.gz
Notes:
- Column order doesn't matter (detected from header)
- Sample names must be unique
- If the
populationcolumn is present, all rows must have values - If the
populationcolumn is absent, all samples are assigned to a "default" population - File paths can be absolute or relative to the current working directory
- Use this format when filenames don't match desired sample names (e.g., files named
abc.sample1.d4.gzbut you want sample namesample1)
Usage:
clam loci -o callable.zarr -m 10 --samples samples.tsv
clam collect -o depths.zarr --samples samples.tsv
Population File
Deprecated
The --population-file option is deprecated. Use --samples with a population column instead.
Defines which samples belong to which population. Used by both clam loci and clam stat.
Format: Tab-separated, two columns, no header.
| Column | Description |
|---|---|
| 1 | Sample name |
| 2 | Population name |
Example:
Notes:
- Sample names must exactly match those in your input files
- For D4 files, sample names are derived from filenames (the part before the first
.) - For VCF/GVCF files, sample names come from the file header
- Each sample should appear exactly once
- Population names can be any string (no spaces)
Chromosome Include/Exclude Files
Specify chromosomes to include or exclude from analysis.
Format: One chromosome name per line, no header.
Example (exclude_chroms.txt):
Example (include_chroms.txt):
Usage:
# Exclude specific chromosomes
clam loci --exclude-file exclude_chroms.txt ...
# Only analyze specific chromosomes
clam loci --include-file include_chroms.txt ...
Per-Chromosome Thresholds File
Specify different depth thresholds for different chromosomes. Useful for sex chromosomes or organellar genomes.
Format: Tab-separated, three columns, no header.
| Column | Description |
|---|---|
| 1 | Chromosome name |
| 2 | Minimum depth |
| 3 | Maximum depth |
Example:
Notes:
- Chromosomes not listed in the file will use the default thresholds from
-mand-Moptions - This allows setting lower thresholds for hemizygous chromosomes (X, Y in XY individuals)
- Mitochondrial/chloroplast genomes often need much higher thresholds
ROH File
Specifies runs of homozygosity (ROH) regions per sample. Used by clam stat to exclude samples within ROH regions when calculating diversity, enabling estimation of non-ROH heterozygosity (πnon-ROH).
Format: BED format (tab-separated) with sample name in the 4th column.
| Column | Description |
|---|---|
| 1 | Chromosome |
| 2 | Start position (0-based) |
| 3 | End position |
| 4 | Sample name |
Requirements:
- Must be bgzipped (
.bed.gz) - Must be tabix indexed (
.bed.gz.tbi)
Example (roh.bed):
chr1 1000000 2000000 sample1
chr1 5000000 5500000 sample1
chr1 1500000 2500000 sample2
chr2 3000000 4000000 sample1
Preparing the file:
# Sort by chromosome and position
sort -k1,1 -k2,2n roh.bed > roh.sorted.bed
# Compress and index
bgzip roh.sorted.bed
tabix -p bed roh.sorted.bed.gz
Notes:
- Sample names must match those in your VCF
- ROH regions can overlap between samples
- Coordinates are 0-based, half-open (standard BED format)
Regions File
Specifies custom regions for calculating statistics in clam stat. Use this instead of --window-size for non-uniform windows or specific genomic features.
Format: Standard BED format (tab-separated).
| Column | Description |
|---|---|
| 1 | Chromosome |
| 2 | Start position (0-based) |
| 3 | End position |
Example (genes.bed):
Notes:
- Coordinates are 0-based, half-open (standard BED format)
- Regions can be of variable size
- Regions should not overlap (statistics are calculated independently per region)
- Additional columns (name, score, etc.) are ignored
Summary Table
| File Type | Extension | Index Required | Used By |
|---|---|---|---|
| D4 depth | .d4 or .d4.gz |
.d4.gz.gzi (if compressed) |
loci, collect |
| GVCF | .g.vcf.gz |
.g.vcf.gz.tbi |
loci, collect |
| VCF | .vcf.gz |
.vcf.gz.tbi |
stat |
| Samples | .tsv |
No | loci, collect |
| Population | .tsv |
No | loci, stat (deprecated for loci) |
| Chromosome list | .txt |
No | loci, stat, collect |
| Thresholds | .tsv |
No | loci |
| ROH | .bed.gz |
.bed.gz.tbi |
stat |
| Regions | .bed |
No | stat |