Generating Callable Loci
The first step is to generate callable loci files, which identify genomic regions where sequencing depth is sufficient for reliable variant calling.
Inputs
clam loci accepts sequencing depth information in two formats:
-
D4 files (default): These are highly compressed estimates of sample depth for every genomic site. You can generate D4 files from alignment (BAM) files using tools like mosdepth. Must be bgzipped and indexed.
-
GVCF files: Use the
--gvcf
flag to specify input files in GVCF format. (Experimental as of v0.1.2)
Generating D4 Files using mosdepth
The following will generate sample.per-base.d4.gz
and sample.per-base.d4.gz.gzi
You can provide input files in several ways:
- As positional arguments:
clam loci file1.d4.gz file2.d4.gz -o output_dir
- Using a file list:
clam loci -f file_list.txt -o output_dir
, one file path per line. - Using a merged D4 file:
clam loci --merged merged_samples.d4 -o output_dir
Options
Depth Thresholds
Control which sites are considered "callable" using the following options:
Per-Sample Thresholds
- The following options control if a site is considered callable at the sample level.
-m, --min-depth
- Minimum depth to consider a site callable for each individual [default: 0]
-M, --max-depth
- Maximum depth to consider a site callable for each individual [default: inf]
--thresholds-file
-
Custom thresholds per chromosome. Tab-separated file: chrom, min, max
Example format:
Population-Level Thresholds
- The following options control if a site is considered callable at the population level.
-d, --depth-proportion
- Proportion of samples that must pass thresholds (0.0-1.0, default: 0)
-u, --min-mean-depth
- Minimum mean depth across all samples (default: 0)
-U, --max-mean-depth
- Maximum mean depth across all samples (default: infinity)
Example requiring at least 80% of samples to pass individual thresholds:
Chromosome Filtering
You can select specific chromosomes to exclude or restrict your analysis to , see CLI Reference for details.
Specifying Populations
clam loci supports multiple populations for estimating dxy and FST downstream with clam stat.
To specify populations (-p, --populations
), create a tab seperated file that maps samples to population labels:
Sample Names
The sample names in your population file must exactly match the sample identifiers contained in your input files. For D4 files, this is typically the prefix of the filename (before .d4.gz
).
Outputs
By default, clam loci generates a callable loci interval file in the specified output directory. This file contains genomic regions that meet your specified callability criteria.
Next Steps
The callable loci file generated by this command can be used with the clam stat command to calculate population genetic statistics while accounting for regions where genotypes could be reliably called.