1. The analysis pipeline
When you upload your genotype file, it passes through a six-stage pipeline. Each stage is independent and fault-tolerant — if one annotation source is temporarily unavailable, the rest continue. Nothing blocks.
Upload & format detection
We accept raw data files from 23andMe (v3–v5), AncestryDNA (v1/v2), MyHeritage, FamilyTreeDNA, and standard VCF files. The parser auto-detects format, encoding, and genome build (GRCh37/GRCh38).
Normalisation & quality control
Genotypes are standardised to canonical form, no-call variants removed, duplicates deduplicated by rsID, and the full set sorted by chromosome and position. Typical output: ~700,000–800,000 clean variants.
Concurrent annotation
Each variant is queried against multiple databases simultaneously using asynchronous I/O. ClinVar for clinical significance, PharmGKB for drug-gene interactions, GWAS Catalog for trait associations, gnomAD for population frequencies, MyVariant.info for aggregated pathogenicity scores (CADD, SIFT, PolyPhen-2).
Polygenic risk scoring
We calculate weighted polygenic risk scores (PRS) for multiple conditions: Type 2 Diabetes, Coronary Artery Disease, Alzheimer’s Disease, Breast Cancer, and BMI/Obesity. Each score is normalised to a population-calibrated percentile (0–100) and categorised as low, average, elevated, or high risk.
Pharmacogenomic profiling
Your metaboliser status is determined for clinically actionable genes: CYP2D6, CYP2C19, CYP1A2, VKORC1, SLCO1B1, MTHFR, and more. Each result maps to specific medications with FDA/EMA-recognised pharmacogenomic labels and PharmGKB evidence levels (1A, 2A).
AI-powered report generation
A frontier language model synthesises all annotation data into a structured, plain-language report. Findings are prioritised by clinical actionability. Every claim is traceable to a source database. The AI is instructed to never overstate, never diagnose, and always recommend professional consultation.
2. Clinical databases
We do not invent our own science. Every annotation in a DeepDNA report comes from established, peer-reviewed, and publicly auditable databases maintained by leading research institutions worldwide.
ClinVar
NIH/NCBI archive of genomic variant–condition relationships. Clinical significance, review status, associated conditions.
PharmGKB
Stanford University pharmacogenomics knowledge base. Drug-gene interactions, dosing guidelines, metaboliser phenotypes.
GWAS Catalog
EMBL-EBI/NHGRI catalogue of genome-wide association studies. Trait associations, effect sizes, p-values from published research.
gnomAD
Broad Institute genome aggregation database. Population allele frequencies across 8+ ancestry groups from 200,000+ individuals.
MyVariant.info
Scripps Research aggregated variant annotation. CADD pathogenicity, SIFT/PolyPhen functional predictions, COSMIC somatic data.
dbSNP
NCBI database of single nucleotide polymorphisms. Reference SNP identifiers and genomic coordinates for all known human variants.
3. AI models & providers
Genomics is not a single-model problem. Different tasks — variant interpretation, protein structure prediction, literature synthesis, risk communication — demand different architectures. DeepDNA integrates multiple frontier AI systems, each deployed where its strengths matter most.
Claude
AnthropicPrimary reasoning engine for report generation, variant interpretation, and interactive genomic chat. Constitutional AI ensures outputs never overstate clinical significance or make diagnostic claims.
Live Report generation ChatAlphaFold / AlphaFold 3
Google DeepMindProtein structure prediction for missense variants. When your DNA carries a variant that changes an amino acid, AlphaFold predicts the structural impact on the resulting protein — is the fold disrupted? Is the active site affected?
Integrating Structure prediction Missense analysisGemini
Google DeepMindMulti-modal reasoning for cross-referencing genomic findings with published medical literature, clinical imaging context, and structured biomarker data.
Integrating Literature synthesis Multi-modalESM-2 / ESMFold
Meta AI (FAIR)Protein language model trained on 250 million sequences. Used for variant effect prediction at the protein level — predicting whether a mutation is likely tolerated or deleterious based on evolutionary conservation.
Integrating Variant effect Protein languageGPT-4o
OpenAISecondary reasoning model for multi-agent review. Used in our quality assurance pipeline where multiple AI perspectives cross-validate findings before they reach your report.
Integrating Quality assurance Multi-agent reviewLlama 3
Meta AIOpen-weight model deployed on-premises for privacy-sensitive tasks. Processes intermediate annotation data that never leaves our European infrastructure. No external API call required.
Integrating On-premises EU-onlyMistral Large
Mistral AI · FranceEuropean-built frontier model. Used for GDPR-compliant text generation where data must remain within EU jurisdiction at every layer, including the model provider.
Integrating EU-native GDPR pipelinePubMedBERT / BioGPT
Microsoft ResearchDomain-specific biomedical language models fine-tuned on PubMed and PMC corpora. Used for extracting structured findings from genomic literature and mapping variants to published evidence.
Integrating Literature mining Evidence gradingWe are model-agnostic by design. As new models emerge — from Anthropic, DeepMind, or the open-source community — we evaluate them against our genomic benchmarks and integrate the ones that improve outcomes. No single provider lock-in. No black box.
4. Beyond DNA: blood tests & biomarkers
Your genome tells you what could happen. Your blood tells you what is happening. The most powerful health intelligence comes from combining both.
How it works
You upload a blood test report (PDF, photo, or structured data). Our system extracts biomarker values using multi-modal AI, then cross-references each value against your genetic profile:
| Biomarker | Genetic context | Insight |
|---|---|---|
| LDL Cholesterol | PCSK9, LDLR, APOB variants | Distinguish lifestyle-driven vs. genetically-driven hyperlipidaemia. Flag familial hypercholesterolaemia risk. |
| HbA1c | TCF7L2, SLC30A8, T2D PRS | Correlate current glucose control with genetic diabetes susceptibility. Early warning system. |
| Vitamin D (25-OH) | VDR, GC, CYP2R1 variants | Explain why some people remain deficient despite supplementation. Personalise dosing. |
| Ferritin / Iron | HFE C282Y, H63D variants | Detect hereditary haemochromatosis carriers. Contextualise iron overload or deficiency. |
| TSH / T4 | FOXE1, TPO, TSHR variants | Genetic thyroid disease predisposition alongside current thyroid function. |
| Homocysteine | MTHFR C677T, A1298C | MTHFR status explains elevated homocysteine. Guides folate supplementation strategy. |
| CRP (hs-CRP) | IL6, CRP, TNF-α variants | Differentiate genetic inflammatory predisposition from acute/chronic inflammation. |
| Creatine Kinase | SLCO1B1 rs4149056 | If you carry the statin myopathy variant, elevated CK may indicate early muscle damage. |
| Lipid Panel | APOE ε2/ε3/ε4 | APOE genotype modulates lipid response to diet. ε4 carriers may need different strategies. |
| Complete Blood Count | HBB, HBA1/HBA2 variants | Identify thalassaemia or sickle cell trait carriers with unexplained microcytic anaemia. |
Supported input formats
- PDF reports from any laboratory (multi-modal AI extraction)
- Photos of printed results (camera or scan)
- HL7 FHIR structured health data (direct integration with clinical systems)
- Manual entry of individual biomarker values
- Wearable sync — continuous biomarker data from devices (planned)
The correlation engine
This is not just two reports side by side. Our correlation engine performs gene–biomarker interaction analysis:
- Genetic risk contextualisation — A high PRS for Type 2 Diabetes combined with borderline HbA1c means something different than either alone. We calculate joint risk estimates.
- Pharmacogenomic monitoring — If you carry CYP2C19 poor metaboliser status and are on clopidogrel, we flag platelet function markers that your doctor should monitor.
- Nutrigenomic optimisation — MTHFR status + homocysteine levels = a precise, evidence-based folate supplementation recommendation, not a generic one.
- Longitudinal tracking — Upload blood tests over time. See how your biomarkers evolve relative to your genetic baseline. Detect drift before it becomes disease.
5. Validation & scientific rigour
We do not guess. Every finding in a DeepDNA report is graded and traceable.
Evidence grading
- Level 1A — Pharmacogenomic associations with FDA/EMA-approved drug labels and CPIC guidelines
- Level 2A — Associations replicated in multiple independent studies with significant effect sizes
- Informational — Associations from single GWAS or preliminary research, clearly labelled as early-stage evidence
What we check
- Every variant annotation is cross-referenced against at least two independent databases
- Polygenic risk scores use published, validated SNP weight sets from peer-reviewed studies
- AI-generated explanations are constrained by system-level instructions that prohibit diagnostic claims, require hedging language for uncertain findings, and mandate professional consultation recommendations
- Pathogenicity predictions (CADD, SIFT, PolyPhen-2) use established computational thresholds, not custom-trained models
What we do not do
- We do not diagnose disease. DeepDNA reports are educational and informational.
- We do not replace genetic counsellors or physicians.
- We do not claim clinical-grade accuracy for consumer genotyping data (which has inherent limitations vs. whole-genome sequencing).
- We do not retain your data after analysis. There is no "DeepDNA database" of genomes.
6. Privacy architecture
Your DNA is processed, not stored. The technical architecture enforces this:
- Processing servers: Hetzner, Helsinki, Finland. EU jurisdiction.
- Encryption: TLS 1.3 in transit, AES-256 at rest during processing.
- Data lifecycle: Upload → parse → annotate → generate report → delete source file.
- No third-party sharing: Annotation queries use rsIDs (variant identifiers), never your full genotype file. No external service sees your complete genetic profile.
- Zero cookies, zero trackers: Our website sets no cookies and uses no analytics trackers.
- AI provider isolation: The AI model receives a summarised report for explanation, not raw variant data. Your genome never reaches the language model in full.
Full details: Privacy Policy.
7. API access for developers & researchers
DeepDNA exposes a RESTful API for programmatic access to the analysis pipeline. Upload a genotype file, receive structured JSON with annotations, risk scores, and pharmacogenomic profiles. Designed for integration into clinical workflows, research pipelines, and third-party health platforms.
POST /api/upload — Submit genotype filePOST /api/analyze/{id} — Run full analysis pipelineGET /api/report/{id} — Retrieve structured report (JSON)POST /api/chat — Interactive genomic Q&A (SSE stream)POST /api/biomarkers — Submit blood test data for correlationGET /api/correlation/{id} — Gene–biomarker correlation report
All API responses include full provenance: source database, evidence level, population frequency, and literature references for every annotation. Machine-readable by design.
8. What makes this different
Most consumer genomics platforms give you a PDF and call it a day. DeepDNA is a living intelligence layer on top of your genome:
- Multi-model AI — Not one model, but an ensemble of frontier systems from Anthropic, Google DeepMind, Meta AI, Mistral, and the open-source community. Each model handles what it does best.
- Real-time databases — Your report draws from live, continuously updated sources. When ClinVar reclassifies a variant, your analysis reflects it.
- DNA + blood integration — The first platform to systematically correlate genetic predisposition with actual biomarker data. Genotype meets phenotype.
- Interactive chat — Ask questions about your results in natural language. The AI has full context of your report and can explain any finding in depth.
- No data retention — We process and delete. Your genome is not our business model.
- European sovereignty — Built in Europe, hosted in Europe, governed by European law. Your genetic data never leaves GDPR jurisdiction.
- Open API — Researchers and developers can integrate the full pipeline into their own systems.