1. Overview
DeepDNA is an AI-powered genomic analysis platform that interprets existing DNA genotype files. We do not perform laboratory sequencing. Instead, we accept raw genotype data files exported from consumer DNA testing providers (23andMe, AncestryDNA, MyHeritage, and others) and apply a multi-layered analysis pipeline to generate a comprehensive health, pharmacogenomics, nutrigenomics, and ancestry report.
Our methodology is built on three principles:
- Clinical-grade data sources — We reference the same databases used in clinical genetics laboratories and academic research
- Transparent evidence grading — Every finding is classified by evidence strength so users and their healthcare providers can assess its relevance
- Reproducibility — Given the same input file, our pipeline produces the same report. Analysis is deterministic at the variant-interpretation layer
2. Input data and file processing
Supported file formats
DeepDNA accepts raw genotype data files in the following formats:
- 23andMe — Tab-separated text files containing rsID, chromosome, position, and genotype
- AncestryDNA — Tab-separated text files with similar variant annotations
- MyHeritage — CSV-formatted genotype exports
- VCF (Variant Call Format) — Standard bioinformatics format used by clinical laboratories and whole-genome sequencing providers
- Other providers — Any file following standard genotype export conventions (rsID + genotype mapping)
File validation
Before analysis begins, each uploaded file undergoes automated validation:
Format detection
The system identifies the file format and source provider automatically. Headers, delimiters, and encoding are verified against known schemas for each provider.
Integrity checks
The file is checked for completeness, duplicate entries, and data consistency. Minimum variant count thresholds ensure the file contains sufficient data for meaningful analysis.
Genome build alignment
Variant positions are mapped to the GRCh37/hg19 reference genome assembly using coordinate liftover where necessary. This ensures consistent annotation regardless of the original provider's build version.
Quality scoring
A file quality score is computed based on call rate, variant coverage across key genomic regions, and concordance with expected population allele frequencies. Files below quality thresholds are flagged for the user.
3. Variant interpretation pipeline
The core of DeepDNA's analysis is a multi-layered variant interpretation pipeline that processes each genetic variant through several stages of annotation, classification, and synthesis.
Stage 1: Variant annotation
Each variant (SNP) in the genotype file is annotated with:
- Gene mapping — The gene (or intergenic region) where the variant is located
- Functional consequence — Whether the variant is missense, synonymous, intronic, regulatory, or intergenic
- Population frequency — Allele frequency data from gnomAD (Genome Aggregation Database), stratified by ancestry
- Conservation scores — Evolutionary conservation metrics indicating functional importance
Stage 2: Clinical significance lookup
Annotated variants are cross-referenced against multiple curated databases to determine clinical relevance:
| Database | Maintained by | Content | Update frequency |
|---|---|---|---|
| ClinVar | NCBI / NIH (United States) | Variant-disease associations submitted by clinical laboratories, research groups, and expert panels worldwide. Over 2.5 million submissions covering pathogenic, benign, and uncertain significance classifications. | Weekly |
| PharmGKB | Stanford University | Pharmacogenomic variant annotations: how genetic variants affect drug metabolism, efficacy, and adverse reactions. Includes CPIC (Clinical Pharmacogenetics Implementation Consortium) guidelines. | Continuous |
| gnomAD | Broad Institute / MIT | Population allele frequencies from over 807,162 individuals across multiple ancestries. Essential for distinguishing rare pathogenic variants from common benign polymorphisms. | Major releases |
| OMIM | Johns Hopkins University | Comprehensive compendium of human genes and genetic phenotypes. Mendelian disorder associations, gene-phenotype relationships. | Daily |
| GWAS Catalog | NHGRI-EBI | Curated collection of published genome-wide association studies. Variant-trait associations with effect sizes and p-values from studies meeting strict quality thresholds. | Weekly |
| dbSNP | NCBI / NIH | Reference database for single nucleotide polymorphisms. Provides standardised rsID identifiers and variant descriptions. | Continuous |
| UniProt | EBI / SIB / PIR | Protein sequence and functional information. Used for assessing the impact of missense variants on protein function. | Monthly |
Stage 3: Pharmacogenomic profiling
For pharmacogenomics analysis, DeepDNA follows established clinical guidelines:
- Star allele calling — Pharmacogenes (CYP2D6, CYP2C19, CYP2C9, CYP3A5, DPYD, TPMT, NUDT15, SLCO1B1, and others) are assigned star alleles based on the combination of detected variants, following PharmVar (Pharmacogene Variation Consortium) nomenclature
- Diplotype determination — Two alleles are resolved into a diplotype representing both inherited copies of the gene
- Metaboliser phenotype assignment — Diplotypes are translated into metaboliser phenotypes (ultrarapid, normal/extensive, intermediate, poor) using CPIC activity score tables
- Drug-gene interaction mapping — Metaboliser phenotypes are mapped to actionable drug recommendations using CPIC Level A and B guidelines, supplemented by DPWG (Dutch Pharmacogenetics Working Group) recommendations where applicable
Stage 4: Polygenic risk score computation
For multifactorial conditions (type 2 diabetes, coronary artery disease, breast cancer, Alzheimer's disease, and others), DeepDNA computes polygenic risk scores (PRS):
- Variant selection — Risk-associated variants are selected from peer-reviewed GWAS with genome-wide significance (p < 5 × 10-8) and replicated across independent cohorts
- Effect weight application — Each variant is weighted by its published effect size (odds ratio or beta coefficient) from the largest available meta-analysis
- Score aggregation — Weighted effects are summed across all available variants to produce a raw polygenic score
- Population percentile mapping — Raw scores are normalised against reference population distributions to provide a percentile ranking (e.g., "your score places you in the 78th percentile")
- Ancestry adjustment — Where sufficient data is available, PRS are adjusted for ancestral background using principal component analysis to reduce population stratification bias
Stage 5: Nutrigenomic analysis
Nutrigenomic findings are derived from variants in genes affecting nutrient metabolism, absorption, and utilisation:
- MTHFR — Folate metabolism and methylation cycle efficiency
- VDR — Vitamin D receptor variants affecting calcium and vitamin D status
- LCT — Lactase persistence and lactose tolerance
- FTO — Fat mass and obesity-associated gene variants
- FADS1/FADS2 — Omega-3 and omega-6 fatty acid desaturase activity
- HFE — Iron absorption and haemochromatosis risk
- CYP1A2 — Caffeine metabolism rate
- ADH1B/ALDH2 — Alcohol metabolism efficiency
Each nutrigenomic finding is linked to its source publication, with practical dietary recommendations based on the genotype result.
Stage 6: Ancestry inference
Ancestry analysis uses a principal component analysis (PCA) approach applied to ancestry-informative markers (AIMs):
- A curated panel of thousands of ancestry-informative SNPs is extracted from the user's genotype data
- Principal components are computed and projected onto reference population clusters derived from the 1000 Genomes Project and HGDP (Human Genome Diversity Project)
- Admixture proportions are estimated using maximum likelihood methods to provide continental and sub-continental ancestry percentages
- Haplogroup assignment is performed for mitochondrial DNA (maternal lineage) and Y-chromosome (paternal lineage, where data is available) markers
4. AI model architecture
DeepDNA's AI layer operates on top of the deterministic variant interpretation pipeline. The AI components serve two functions:
Variant effect prediction
For variants of uncertain significance (VUS) — those not yet classified in ClinVar or other curated databases — our AI models predict functional impact using:
- Protein structure context — Leveraging AlphaFold-predicted protein structures to assess whether amino acid changes occur in functional domains, binding sites, or structurally critical regions
- Evolutionary conservation — Multi-species sequence alignment scores (phyloP, phastCons) indicating positions under selective pressure
- Ensemble pathogenicity scoring — Aggregation of multiple in-silico predictors (CADD, REVEL, MetaRNN) into a consensus pathogenicity estimate
- Functional domain analysis — Annotation of protein domains (InterPro, Pfam) to contextualise variant location within known functional elements
Report synthesis and natural language generation
The second AI layer synthesises the structured variant data into readable, personalised report narratives:
- Technical variant annotations are translated into plain-language explanations
- Related findings across health, pharmacogenomics, and nutrigenomics are cross-referenced to surface interconnections
- Actionable recommendations are generated based on the aggregate genetic profile
- All generated text is constrained by a medical-safety guardrail system that prevents overstating findings, making diagnostic claims, or recommending specific treatments
5. Evidence grading system
Every finding in a DeepDNA report is assigned an evidence grade that reflects the strength of the underlying scientific evidence:
| Grade | Label | Criteria |
|---|---|---|
| A | Established | Supported by multiple large-scale studies, meta-analyses, or clinical guidelines (CPIC Level A, ClinVar pathogenic with expert review). Actionable with high confidence. |
| B | Strong evidence | Supported by replicated GWAS findings (genome-wide significance in independent cohorts), CPIC Level B guidelines, or ClinVar pathogenic/likely pathogenic with multiple submitters. |
| C | Moderate evidence | Supported by published studies with consistent findings but limited replication, single large GWAS, or ClinVar likely pathogenic with single submitter. |
| D | Preliminary | Based on early-stage research, small sample sizes, or AI-predicted functional impact. Included for informational context but should not drive health decisions without further validation. |
The evidence grade is displayed alongside every finding in the report. We encourage users to share Grade A and B findings with their healthcare providers and to treat Grade D findings as exploratory.
6. Quality standards and validation
Internal validation
Our variant interpretation pipeline is validated through:
- Concordance testing — Results are compared against published clinical-grade reports for well-characterised reference samples (e.g., NA12878 / HG001 from the Genome in a Bottle Consortium)
- Pharmacogenomic benchmarking — Star allele calls are validated against the GeT-RM (Genetic Testing Reference Materials) programme at the CDC
- PRS calibration — Polygenic risk scores are calibrated against known population distributions using data from the UK Biobank and other large cohorts
- Regression testing — Every pipeline update is tested against a suite of reference genotype files to ensure no regressions in variant calls or report content
Database currency
Clinical databases are updated on a regular schedule:
- ClinVar and GWAS Catalog annotations are synchronised at least monthly
- PharmGKB and CPIC guidelines are updated within 30 days of new guideline publications
- gnomAD and population frequency data are updated with each major release
Limitations and known constraints
We are transparent about the limitations of our approach:
- Genotype arrays are not whole-genome sequencing. Consumer DNA files typically contain 600,000 to 900,000 SNPs, representing a subset of the approximately 4–5 million common variants in a human genome. Some clinically relevant variants may not be present in the input file.
- Structural variants are not captured. Copy number variations (CNVs), large insertions/deletions, and chromosomal rearrangements cannot be reliably detected from microarray genotype data.
- Ancestry bias in GWAS. The majority of published GWAS have been conducted in populations of European ancestry. Polygenic risk scores may have reduced predictive accuracy for individuals of non-European ancestry. We disclose this limitation in every PRS finding.
- Gene-environment interaction. Genetic risk is one component of overall health risk. Environmental factors, lifestyle, and epigenetics are not captured by genotype analysis alone. Our reports include contextual information about modifiable risk factors where relevant.
- Evolving science. Genomic research is advancing rapidly. Variant classifications may change as new evidence emerges. DeepDNA reports reflect the state of knowledge at the time of analysis.
7. Ethical standards
- No diagnostic claims. DeepDNA reports are for educational and informational purposes only. We do not diagnose medical conditions.
- No deterministic language. We use probabilistic language ("increased risk," "may affect") rather than deterministic statements ("you will develop"). Genetic risk is a forecast, not a sentence.
- Incidental findings policy. If our analysis detects a variant classified as pathogenic for a serious, actionable condition (as defined by the ACMG Secondary Findings list, SF v3.2), we include a clear, sensitive notification in the report with a strong recommendation to consult a clinical geneticist.
- Privacy-first processing. Genotype files are processed in memory, used to generate the report, and then permanently deleted. We do not retain, aggregate, or analyse genetic data across users. See our Privacy Policy for full details.
- No data monetisation. We do not sell, licence, or share genetic data or analysis results with any third party, including pharmaceutical companies, insurance providers, or research institutions.
8. References and further reading
Key publications and resources underpinning our methodology:
- Landrum MJ et al. "ClinVar: improvements to accessing data." Nucleic Acids Research, 2020. doi:10.1093/nar/gkz972
- Whirl-Carrillo M et al. "An evidence-based framework for evaluating pharmacogenomics knowledge for personalized medicine." Clinical Pharmacology & Therapeutics, 2021. doi:10.1002/cpt.2350
- Karczewski KJ et al. "The mutational constraint spectrum quantified from variation in 141,456 humans." Nature, 2020. doi:10.1038/s41586-020-2308-7
- Swen JJ et al. "A 12-gene pharmacogenetic panel to prevent adverse drug reactions: an open-label, multicentre, controlled, cluster-randomised crossover implementation study." The Lancet, 2023. doi:10.1016/S0140-6736(22)01841-4 (PREPARE trial)
- Choi SW et al. "Tutorial: a guide to performing polygenic risk score analyses." Nature Protocols, 2020. doi:10.1038/s41596-020-0353-1
- Jumper J et al. "Highly accurate protein structure prediction with AlphaFold." Nature, 2021. doi:10.1038/s41586-021-03819-2
- Miller DT et al. "ACMG SF v3.2 list for reporting of secondary findings in clinical exome and genome sequencing." Genetics in Medicine, 2023. doi:10.1016/j.gim.2023.04.006
- Buniello A et al. "The NHGRI-EBI GWAS Catalog of published genome-wide association studies." Nucleic Acids Research, 2019. doi:10.1093/nar/gky1120
9. Contact
For scientific questions, methodology feedback, or requests for additional technical documentation:
- Email: [email protected]
For general enquiries:
- Email: [email protected]
See also: About DeepDNA · Privacy Policy · Terms of Service · How It Works