Methodology | DeepDNA — Scientific Approach to Genomic Analysis

  Our commitment: Every finding in a DeepDNA report is grounded in peer-reviewed research, validated against clinical-grade databases, and graded by evidence level. We cite our sources, explain our confidence, and never overstate what the science says.

1. Overview

DeepDNA is an AI-powered genomic analysis platform that interprets existing DNA genotype files. We do not perform laboratory sequencing. Instead, we accept raw genotype data files exported from consumer DNA testing providers (23andMe, AncestryDNA, MyHeritage, and others) and apply a multi-layered analysis pipeline to generate a comprehensive health, pharmacogenomics, nutrigenomics, and ancestry report.

Our methodology is built on three principles:

Clinical-grade data sources — We reference the same databases used in clinical genetics laboratories and academic research
Transparent evidence grading — Every finding is classified by evidence strength so users and their healthcare providers can assess its relevance
Reproducibility — Given the same input file, our pipeline produces the same report. Analysis is deterministic at the variant-interpretation layer

2. Input data and file processing

Supported file formats

DeepDNA accepts raw genotype data files in the following formats:

23andMe — Tab-separated text files containing rsID, chromosome, position, and genotype
AncestryDNA — Tab-separated text files with similar variant annotations
MyHeritage — CSV-formatted genotype exports
VCF (Variant Call Format) — Standard bioinformatics format used by clinical laboratories and whole-genome sequencing providers
Other providers — Any file following standard genotype export conventions (rsID + genotype mapping)

File validation

Before analysis begins, each uploaded file undergoes automated validation:

Format detection

The system identifies the file format and source provider automatically. Headers, delimiters, and encoding are verified against known schemas for each provider.

Integrity checks

The file is checked for completeness, duplicate entries, and data consistency. Minimum variant count thresholds ensure the file contains sufficient data for meaningful analysis.

Genome build alignment

Variant positions are mapped to the GRCh37/hg19 reference genome assembly using coordinate liftover where necessary. This ensures consistent annotation regardless of the original provider's build version.

Quality scoring

A file quality score is computed based on call rate, variant coverage across key genomic regions, and concordance with expected population allele frequencies. Files below quality thresholds are flagged for the user.

3. Variant interpretation pipeline

The core of DeepDNA's analysis is a multi-layered variant interpretation pipeline that processes each genetic variant through several stages of annotation, classification, and synthesis.

Stage 1: Variant annotation

Each variant (SNP) in the genotype file is annotated with:

Gene mapping — The gene (or intergenic region) where the variant is located
Functional consequence — Whether the variant is missense, synonymous, intronic, regulatory, or intergenic
Population frequency — Allele frequency data from gnomAD (Genome Aggregation Database), stratified by ancestry
Conservation scores — Evolutionary conservation metrics indicating functional importance

Stage 2: Clinical significance lookup

Annotated variants are cross-referenced against multiple curated databases to determine clinical relevance:

Database	Maintained by	Content	Update frequency
ClinVar	NCBI / NIH (United States)	Variant-disease associations submitted by clinical laboratories, research groups, and expert panels worldwide. Over 2.5 million submissions covering pathogenic, benign, and uncertain significance classifications.	Weekly
PharmGKB	Stanford University	Pharmacogenomic variant annotations: how genetic variants affect drug metabolism, efficacy, and adverse reactions. Includes CPIC (Clinical Pharmacogenetics Implementation Consortium) guidelines.	Continuous
gnomAD	Broad Institute / MIT	Population allele frequencies from over 807,162 individuals across multiple ancestries. Essential for distinguishing rare pathogenic variants from common benign polymorphisms.	Major releases
OMIM	Johns Hopkins University	Comprehensive compendium of human genes and genetic phenotypes. Mendelian disorder associations, gene-phenotype relationships.	Daily
GWAS Catalog	NHGRI-EBI	Curated collection of published genome-wide association studies. Variant-trait associations with effect sizes and p-values from studies meeting strict quality thresholds.	Weekly
dbSNP	NCBI / NIH	Reference database for single nucleotide polymorphisms. Provides standardised rsID identifiers and variant descriptions.	Continuous
UniProt	EBI / SIB / PIR	Protein sequence and functional information. Used for assessing the impact of missense variants on protein function.	Monthly

Stage 3: Pharmacogenomic profiling

For pharmacogenomics analysis, DeepDNA follows established clinical guidelines:

Star allele calling — Pharmacogenes (CYP2D6, CYP2C19, CYP2C9, CYP3A5, DPYD, TPMT, NUDT15, SLCO1B1, and others) are assigned star alleles based on the combination of detected variants, following PharmVar (Pharmacogene Variation Consortium) nomenclature
Diplotype determination — Two alleles are resolved into a diplotype representing both inherited copies of the gene
Metaboliser phenotype assignment — Diplotypes are translated into metaboliser phenotypes (ultrarapid, normal/extensive, intermediate, poor) using CPIC activity score tables
Drug-gene interaction mapping — Metaboliser phenotypes are mapped to actionable drug recommendations using CPIC Level A and B guidelines, supplemented by DPWG (Dutch Pharmacogenetics Working Group) recommendations where applicable

  Clinical note: DeepDNA pharmacogenomic results are for informational purposes only. Genotype arrays do not capture all pharmacogenomic variants (e.g., CYP2D6 gene deletions or duplications may not be detectable from microarray data). Always consult a healthcare provider before making medication changes.

Stage 4: Polygenic risk score computation

For multifactorial conditions (type 2 diabetes, coronary artery disease, breast cancer, Alzheimer's disease, and others), DeepDNA computes polygenic risk scores (PRS):

Variant selection — Risk-associated variants are selected from peer-reviewed GWAS with genome-wide significance (p < 5 × 10^-8) and replicated across independent cohorts
Effect weight application — Each variant is weighted by its published effect size (odds ratio or beta coefficient) from the largest available meta-analysis
Score aggregation — Weighted effects are summed across all available variants to produce a raw polygenic score
Population percentile mapping — Raw scores are normalised against reference population distributions to provide a percentile ranking (e.g., "your score places you in the 78th percentile")
Ancestry adjustment — Where sufficient data is available, PRS are adjusted for ancestral background using principal component analysis to reduce population stratification bias

Stage 5: Nutrigenomic analysis

Nutrigenomic findings are derived from variants in genes affecting nutrient metabolism, absorption, and utilisation:

MTHFR — Folate metabolism and methylation cycle efficiency
VDR — Vitamin D receptor variants affecting calcium and vitamin D status
LCT — Lactase persistence and lactose tolerance
FTO — Fat mass and obesity-associated gene variants
FADS1/FADS2 — Omega-3 and omega-6 fatty acid desaturase activity
HFE — Iron absorption and haemochromatosis risk
CYP1A2 — Caffeine metabolism rate
ADH1B/ALDH2 — Alcohol metabolism efficiency

Each nutrigenomic finding is linked to its source publication, with practical dietary recommendations based on the genotype result.

Stage 6: Ancestry inference

Ancestry analysis uses a principal component analysis (PCA) approach applied to ancestry-informative markers (AIMs):

A curated panel of thousands of ancestry-informative SNPs is extracted from the user's genotype data
Principal components are computed and projected onto reference population clusters derived from the 1000 Genomes Project and HGDP (Human Genome Diversity Project)
Admixture proportions are estimated using maximum likelihood methods to provide continental and sub-continental ancestry percentages
Haplogroup assignment is performed for mitochondrial DNA (maternal lineage) and Y-chromosome (paternal lineage, where data is available) markers

4. AI model architecture

DeepDNA's AI layer operates on top of the deterministic variant interpretation pipeline. The AI components serve two functions:

Variant effect prediction

For variants of uncertain significance (VUS) — those not yet classified in ClinVar or other curated databases — our AI models predict functional impact using:

Protein structure context — Leveraging AlphaFold-predicted protein structures to assess whether amino acid changes occur in functional domains, binding sites, or structurally critical regions
Evolutionary conservation — Multi-species sequence alignment scores (phyloP, phastCons) indicating positions under selective pressure
Ensemble pathogenicity scoring — Aggregation of multiple in-silico predictors (CADD, REVEL, MetaRNN) into a consensus pathogenicity estimate
Functional domain analysis — Annotation of protein domains (InterPro, Pfam) to contextualise variant location within known functional elements

  Transparency principle: When a finding is based on AI prediction rather than established clinical classification, it is always clearly labelled in the report as "AI-predicted" with an associated confidence score. We never present AI predictions as equivalent to clinically validated findings.

Report synthesis and natural language generation

The second AI layer synthesises the structured variant data into readable, personalised report narratives:

Technical variant annotations are translated into plain-language explanations
Related findings across health, pharmacogenomics, and nutrigenomics are cross-referenced to surface interconnections
Actionable recommendations are generated based on the aggregate genetic profile
All generated text is constrained by a medical-safety guardrail system that prevents overstating findings, making diagnostic claims, or recommending specific treatments

5. Evidence grading system

Every finding in a DeepDNA report is assigned an evidence grade that reflects the strength of the underlying scientific evidence:

Grade	Label	Criteria
A	Established	Supported by multiple large-scale studies, meta-analyses, or clinical guidelines (CPIC Level A, ClinVar pathogenic with expert review). Actionable with high confidence.
B	Strong evidence	Supported by replicated GWAS findings (genome-wide significance in independent cohorts), CPIC Level B guidelines, or ClinVar pathogenic/likely pathogenic with multiple submitters.
C	Moderate evidence	Supported by published studies with consistent findings but limited replication, single large GWAS, or ClinVar likely pathogenic with single submitter.
D	Preliminary	Based on early-stage research, small sample sizes, or AI-predicted functional impact. Included for informational context but should not drive health decisions without further validation.

The evidence grade is displayed alongside every finding in the report. We encourage users to share Grade A and B findings with their healthcare providers and to treat Grade D findings as exploratory.

6. Quality standards and validation

Internal validation

Our variant interpretation pipeline is validated through:

Concordance testing — Results are compared against published clinical-grade reports for well-characterised reference samples (e.g., NA12878 / HG001 from the Genome in a Bottle Consortium)
Pharmacogenomic benchmarking — Star allele calls are validated against the GeT-RM (Genetic Testing Reference Materials) programme at the CDC
PRS calibration — Polygenic risk scores are calibrated against known population distributions using data from the UK Biobank and other large cohorts
Regression testing — Every pipeline update is tested against a suite of reference genotype files to ensure no regressions in variant calls or report content

Database currency

Clinical databases are updated on a regular schedule:

ClinVar and GWAS Catalog annotations are synchronised at least monthly
PharmGKB and CPIC guidelines are updated within 30 days of new guideline publications
gnomAD and population frequency data are updated with each major release

Limitations and known constraints

We are transparent about the limitations of our approach:

Genotype arrays are not whole-genome sequencing. Consumer DNA files typically contain 600,000 to 900,000 SNPs, representing a subset of the approximately 4–5 million common variants in a human genome. Some clinically relevant variants may not be present in the input file.
Structural variants are not captured. Copy number variations (CNVs), large insertions/deletions, and chromosomal rearrangements cannot be reliably detected from microarray genotype data.
Ancestry bias in GWAS. The majority of published GWAS have been conducted in populations of European ancestry. Polygenic risk scores may have reduced predictive accuracy for individuals of non-European ancestry. We disclose this limitation in every PRS finding.
Gene-environment interaction. Genetic risk is one component of overall health risk. Environmental factors, lifestyle, and epigenetics are not captured by genotype analysis alone. Our reports include contextual information about modifiable risk factors where relevant.
Evolving science. Genomic research is advancing rapidly. Variant classifications may change as new evidence emerges. DeepDNA reports reflect the state of knowledge at the time of analysis.

7. Ethical standards

No diagnostic claims. DeepDNA reports are for educational and informational purposes only. We do not diagnose medical conditions.
No deterministic language. We use probabilistic language ("increased risk," "may affect") rather than deterministic statements ("you will develop"). Genetic risk is a forecast, not a sentence.
Incidental findings policy. If our analysis detects a variant classified as pathogenic for a serious, actionable condition (as defined by the ACMG Secondary Findings list, SF v3.2), we include a clear, sensitive notification in the report with a strong recommendation to consult a clinical geneticist.
Privacy-first processing. Genotype files are processed in memory, used to generate the report, and then permanently deleted. We do not retain, aggregate, or analyse genetic data across users. See our Privacy Policy for full details.
No data monetisation. We do not sell, licence, or share genetic data or analysis results with any third party, including pharmaceutical companies, insurance providers, or research institutions.

8. References and further reading

Key publications and resources underpinning our methodology:

Landrum MJ et al. "ClinVar: improvements to accessing data." Nucleic Acids Research, 2020. doi:10.1093/nar/gkz972
Whirl-Carrillo M et al. "An evidence-based framework for evaluating pharmacogenomics knowledge for personalized medicine." Clinical Pharmacology & Therapeutics, 2021. doi:10.1002/cpt.2350
Karczewski KJ et al. "The mutational constraint spectrum quantified from variation in 141,456 humans." Nature, 2020. doi:10.1038/s41586-020-2308-7
Swen JJ et al. "A 12-gene pharmacogenetic panel to prevent adverse drug reactions: an open-label, multicentre, controlled, cluster-randomised crossover implementation study." The Lancet, 2023. doi:10.1016/S0140-6736(22)01841-4 (PREPARE trial)
Choi SW et al. "Tutorial: a guide to performing polygenic risk score analyses." Nature Protocols, 2020. doi:10.1038/s41596-020-0353-1
Jumper J et al. "Highly accurate protein structure prediction with AlphaFold." Nature, 2021. doi:10.1038/s41586-021-03819-2
Miller DT et al. "ACMG SF v3.2 list for reporting of secondary findings in clinical exome and genome sequencing." Genetics in Medicine, 2023. doi:10.1016/j.gim.2023.04.006
Buniello A et al. "The NHGRI-EBI GWAS Catalog of published genome-wide association studies." Nucleic Acids Research, 2019. doi:10.1093/nar/gky1120

9. Contact

For scientific questions, methodology feedback, or requests for additional technical documentation:

Email: [email protected]

For general enquiries:

Email: [email protected]