Short method summary

In recent years, we saw rapid technological advances in the identification of structural variants (SVs) in the human genome; however, the interpretation of these variants remains challenging. Several methods were developed that utilize individual mechanistic principles like the deletion of coding sequence or 3D genome architecture disruptions. Unfortunately, a comprehensive and easy-to-use tool that utilizes the broad spectrum of available annotations for estimating the effect of SVs and prioritize functional variants in health and disease was missing. Therefore, we developed CADD-SV, a method to retrieve and integrate a wide set of annotations of SVs for functional scoring. So far, supervised learning approaches were of very limited power for this kind of application, due to a very small number of functionally annotated (e.g. pathogenic/benign) SV sets. We overcome this problem by using a surrogate training-objective, the Combined Annotation Dependent Depletion (CADD) of functional variants in evolutionary derived variant sets. Our tool computes annotation summary statistics across the range and in the vicinity of SVs. We apply random forest models to differentiate deleterious from neutral structural variants. Specifically, we use human and chimpanzee derived alleles as proxy-neutral and contrast them with matched simulated variants as proxy-pathogenic, an approach that has proven powerful in the interpretation of SNVs and short InDels for CADD.

Versions:

  • v1.0 is the initial release of CADD-SV
  • v1.1 introduces a final PHRED-scaled model score to ease interpretation

Notes on using scaled vs. unscaled scores

We believe that CADD scores are useful in two distinct forms, namely "raw" and "scaled", and we provide both in our output files. "Raw" CADD scores come straight from the model, and are interpretable as the extent to which the annotation profile for a given variant suggests that the variant is likely to be "observed" (negative values) vs "simulated" (positive values). These values have no absolute unit of meaning and are incomparable across distinct annotation combinations, training sets, or model parameters. However, raw values do have relative meaning, with higher values indicating that a variant is more likely to be simulated (or "not observed") and therefore more likely to have deleterious effects.

Since the raw scores do have relative meaning, one can take a specific group of variants, define the rank for each variant within that group, and then use that value as a "normalized" and now externally comparable unit of analysis. In our case, we scored and ranked all gnomAD SVs in release v2.1 and express the variants score as rank relative to these population variants.

For better interpretation and starting in v1.1, we provide a PHRED-scaled transformation of the model score relative to a healthy population cohort, i.e. a log10 score derived from the proportion of variants with a greater or equal score in the genomAD-SV set. The CADD-SV scores on the PHRED scale range from 0 (potentially benign) to 48 (potentially pathogenic), indicating the position of the novel variant within the gnomAD-SV score distribution. For example, a score above three corresponds to the top 50%, 10 corresponds to the top 10%, 20 to the top 1% and 30 to the top 0.1% of scores observed from gnomAD-SV.

The advantages and disadvantages of the score sets are summarized as follows:

1. Resolution: Raw scores offer superior resolution across the entire spectrum, and preserve relative differences between scores that may otherwise be rounded away in the scaled scores. As the population variants might be missing very high scores observed for pathogenic variants, the resolution will be limited at this extreme end. As a result, several variants with substantive raw score differences between them will be necessarily forced to the same or very similar rank unit.

2. Frame of reference: Since there must always be a top-ranked variant, second-ranked variant, etc, scaled scores are easier to interpret at first glance and will be comparable across CADD-SV versions as we, for example, update the model to include new annotations (or even use an entirely distinct model-building method).

We envision the "typical use" cases for CADD-SV, and appropriate choice of score set, as follows:

1. Discovering causal variants within an individual, or small groups, of exomes or genomes. Scaled CADD-SV scores are most useful in this context, as one will generally only be interested or capable of reviewing a small set of the "most interesting" variants. Further, the absolute frame of the reference variance set is valuable here, allowing an analyst to quickly place a variant in context and facilitate easier translation of results across publications, studies, etc.

2. Fine-mapping to discover causal variants within associated loci. As above, scaled scores are likely to be more useful here by allowing focus on a small set of manually reviewable best candidates and providing the absolute frame of the reference genome.

3. Comparing distributions of scores between groups of variants, e.g., cases vs controls. In this case, raw scores should be used, as they preserve distinctions that may be relevant across the entire scoring spectrum. Scaled scores may obscure systematic and potentially highly significant distinctions between two groups of variants. Further, since such analyses are generally conducted computationally and without manual intervention, the absolute frame of reference advantage to scaled scores is not as valuable here.

What score cutoff should I use?

There is not a natural choice here -- it is always arbitrary to define a fixed threshold. We therefore recommend integrating our scores with other evidence and to rank your candidates for follow up rather than hard filtering.

I fail to retrieve scores for my variants using the webserver. What is going wrong?

(1) If your upload fails it can be for two different reasons:

(a) You are attempting to upload a file larger than 2MB, which is automatically rejected by the webserver with a connection reset (white page, server error). In this case, please submit your variant set in smaller pieces or try removing additional columns in the BED (CADD-SV only requires the first 5 columns) to meet the upload limit. Also consider gzip-compression of your BED file. We generally recommend submitting variants in small batches, as different submissions can be processed in parallel.

(b) If the file is smaller than 2MB, but it is not correctly formatted as a BED or the file extension is neither bed, tsv, txt nor gz, you get the "Your upload failed." error message on the regular CADD-SV website with some description on how the uploaded file needs to be formatted and named. If you get this type of error, please adjust the formatting of the information (i.e. 5 columns: CHROM, START, END, TYPE, NAME; NAME column can be empty or missing) and make sure your file has one of the filename extensions mentioned above. The upload will also fail if the file is formatted with the older MAC new line characters ('\r'). UNIX ('\n') and Windows ('\r\n') formatted files work.

I like to use CADD-SV for annotating more than the >10,000 variants provided through the webinterface or would like to use CADD-SV for annotating variants on a regular basis. What should I do?

The webinterface has an arbitrarily introduced 2MB limit for computational reasons as well as to make it unattractive to use the webserver for scoring large sets of variants. We provide pre-scored files. You can use these to initiate a local dataset and then either score additional variants on our server or use our offline scoring scripts.
For scoring your variants locally, we are providing the required annotation tracks and a set of scripts on the download page. Please check the installation instruction detailed in the README provided with the scripts, before downloading the much larger annotations files.

The files that we provide are block-gzip compressed and tabix indexed and allow for fast retrieval of specific genomic locations using the SAMtools/HTS library. There are bindings for several programming languages, which allow easy scripting for retrieving specific variants. Please note that recent versions of tabix work even without downloading the entire data file. You can therefore do a tabix call on the file on our server and (after automatically downloading the index) retrieve multiple variants in quick succession. Please see also our API page for further information.

How to cite CADD-SV?

CADD-SV has been published as a research article in Genome Research, please cite the following paper:

Philip Kleinert P, Kircher M
A framework to score the effects of structural variants in health and disease
Genome Res. 2022 Apr;32(4):766-777. doi: 10.1101/gr.275995.121. Epub 2022 Feb 23.
PubMed PMID: 35197310.

If you want to reference the concept behind CADD, please cite:

Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J.
A general framework for estimating the relative pathogenicity of human genetic variants.
Nat Genet. 2014 Feb 2. doi: 10.1038/ng.2892.
PubMed PMID: 24487276.