Skip to content

Generating Callable Loci

The first step is to generate callable loci files, which identify genomic regions where sequencing depth is sufficient for reliable variant calling.

Inputs

clam loci accepts sequencing depth information in two formats:

  1. D4 files (default): These are highly compressed estimates of sample depth for every genomic site. You can generate D4 files from alignment (BAM) files using tools like mosdepth. Must be bgzipped and indexed.

  2. GVCF files: Use the --gvcf flag to specify input files in GVCF format. (Experimental as of v0.1.2)

Generating D4 Files using mosdepth

The following will generate sample.per-base.d4.gz and sample.per-base.d4.gz.gzi

mosdepth --d4 sample sample.bam
bgzip --index sample.per-base.d4

You can provide input files in several ways:

  • As positional arguments: clam loci file1.d4.gz file2.d4.gz -o output_dir
  • Using a file list: clam loci -f file_list.txt -o output_dir, one file path per line.
  • Using a merged D4 file: clam loci --merged merged_samples.d4 -o output_dir

Options

Depth Thresholds

Control which sites are considered "callable" using the following options:

Per-Sample Thresholds

The following options control if a site is considered callable at the sample level.
-m, --min-depth
Minimum depth to consider a site callable for each individual [default: 0]
-M, --max-depth
Maximum depth to consider a site callable for each individual [default: inf]
--thresholds-file

Custom thresholds per chromosome. Tab-separated file: chrom, min, max

Example format:

chr1    10    100
chr2    5     50
chrX    15    150

Population-Level Thresholds

The following options control if a site is considered callable at the population level.
-d, --depth-proportion
Proportion of samples that must pass thresholds (0.0-1.0, default: 0)
-u, --min-mean-depth
Minimum mean depth across all samples (default: 0)
-U, --max-mean-depth
Maximum mean depth across all samples (default: infinity)

Example requiring at least 80% of samples to pass individual thresholds:

clam loci -m 10 -M 100 -d 0.8 -o output sample1.d4.gz ...

Chromosome Filtering

You can select specific chromosomes to exclude or restrict your analysis to , see CLI Reference for details.

Specifying Populations

clam loci supports multiple populations for estimating dxy and FST downstream with clam stat.

To specify populations (-p, --populations), create a tab seperated file that maps samples to population labels:

sample1    population1
sample2    population1
sample3    population2
sample4    population2
sample5    population3

Sample Names

The sample names in your population file must exactly match the sample identifiers contained in your input files. For D4 files, this is typically the prefix of the filename (before .d4.gz).

Outputs

By default, clam loci generates a callable loci interval file in the specified output directory. This file contains genomic regions that meet your specified callability criteria.

Next Steps

The callable loci file generated by this command can be used with the clam stat command to calculate population genetic statistics while accounting for regions where genotypes could be reliably called.