We’ve developed an SAT-solver based algorithm to generate synthetic data that is hallucination-resistant and provable private.
Synthetic data is used in clinical applications to increase representation of rare events for machine learning, make data shareable without privacy implications, or create unique test cases for quality control.
However, current computational approaches for creating synthetic representations from real data have limitations that hamper their applicability. Specifically, statistical methods like Markov Chains rely on haplotype-shuffling, which risks of inadvertently exposing genomic segments that hold sensitive or personally identifiable information of the participants.
Similarly, Artificial intelligence (AI) approaches such as RBM or GAN generate novel sequences that can contain variant combinations that are potentially not viable in humans, i.e. “hallucination”.
We developed Genomator (patent pending), a SAT-solving based approach, whose logic-solving engine prevents “fake” data generation (hallucination-free) with the capability to generate provably private data.
Value Proposition
- Hallucination-resistant: a solution based on logic-solving cannot create synthetic data that is incompatible with reality, unlike AI-based appraoches.
- Provably private: our approach is reversible, enabling us to identify exactly whose data was used and thereby quantifying data leakage absolutely rather by proxy.
- Tailoring accuracy-privacy tradeoff: Genomator has a "privacy" knob that can generate data for different usecases, from clinical research, requiring high accuracy, to increasing underrepresented samples, requiring absolute privacy.
- Highly scalable: our approach is resource efficient and is scallable to whole genome sequencing data of large cohorts.
Our Approach
SAT solvers are a class of algorithms used to solve problems by (SAT)isfying constraints between Boolean variables. This ensures the efficient deductive construction of synthetic genomes from input data that ensures no unrealistic variant combinations are created
Application cases
- Increasing underrepresented samples: Currently, 95% of genomic data comes from individuals of European ancestry. This underrepresentation can lead to biased and incomplete medical understandings. Genomator can generate synthetic genomes that accurately represent different ethnicities and populations, allowing researchers to augment existing datasets.
- Sharing data: Direct-to-consumer products offer a variety of insights generated from genomes, such as ancestry and drug metabolism. However, sharing exact genomes with third party providers can be a security risk. Genomator can create anonymized digital clones of an individual’s genome to be safely exposed to untrusted third party medical or direct-to-consumer services by systematically obfuscating identifiability information.
- Creating realistic test cases: Quality control checks are increasingly employed to standardize genomic services. Genomator can rapidly generate plausible genomic data to aid in genomic software and process pipeline testing, particularly covering the whole genome of 3 billion letters.
- Domain agnostic: Generating synthetic data using our approach is not limited to the genomic space. Genomator is application-agnostic and can potentially create anonymised representations of high-dimensional digital datasets (e.g. market analysis, consumer information).
Do business with us
Let us be your innovation catalyst by helping you understand the health space, solve your pain-points and innovate to keep you ahead. Read more