PhenoEncoder: A Discriminative Embedding Approach to Genomic Data Compression

Exploring the heritability of complex genetic traits requires methods that can handle the genome's vast scale and the intricate relationships among genetic markers. Widely accepted association studies overlook non-linear effects (epistasis), prompting the adoption of deep neural networks (DNNs)...

Full description

Saved in:
Bibliographic Details
Published in:bioRxiv
Main Authors: Tas, Gizem, Postma, Eric, Balvert, Marleen, Schoenhuth, Alexander
Format: Paper
Language:English
Published: Cold Spring Harbor Cold Spring Harbor Laboratory Press 10.12.2024
Cold Spring Harbor Laboratory
Edition:1.1
Subjects:
ISSN:2692-8205, 2692-8205
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Exploring the heritability of complex genetic traits requires methods that can handle the genome's vast scale and the intricate relationships among genetic markers. Widely accepted association studies overlook non-linear effects (epistasis), prompting the adoption of deep neural networks (DNNs) for their scalability with large genetic datasets. However, the curse of dimensionality continues to limit the potential of DNNs, underscoring the critical need for dimensionality reduction for suitably sizing and shaping the genetic inputs, while preserving epistasis. Linkage disequilibrium (LD), a measure of correlation between genetic loci, offers a pathway to genome compression with minimal information loss by defining SNP windows. These windows constitute genomic regions, i.e., haplotype blocks, which can be locally compressed using deep autoencoders. While autoencoders excel at preserving meaningful non-linear patterns, they still risk losing phenotype-relevant information when dominated by other sources of genetic variation. We propose a novel approach, PhenoEncoder, that incorporates phenotypic variance directly into compression. This SNP-based pipeline employs multiple autoencoders, each dedicated to compressing a single haplotype block. The window-based sparsity of the model eases the computational burden of simultaneously processing numerous SNPs. Concurrently, an auxiliary classifier predicts the phenotype from the compressed haplotype blocks. Epistasis is processed both within and between haplotype blocks by maintaining non-linearity in the autoencoders and the classifier. Through joint optimization of the compression and classification losses, PhenoEncoder ensures that disease-causing patterns are highlighted during compression. Applied to a mice protein expression dataset and a simulated complex phenotype dataset from VariantSpark, PhenoEncoder demonstrated enhanced generalizability in downstream classification tasks compared to standard autoencoder compression. Notably, the PhenoEncoder model itself achieved classification performance on par with these tasks. By enabling phenotype-aware compression, PhenoEncoder emerges as a promising approach for discriminative genomic feature extraction.Competing Interest StatementThe authors have declared no competing interest.Footnotes* https://archive.ics.uci.edu/dataset/342/mice+protein+expression* http://gigadb.org/dataset/100759* https://github.com/gizem-tas/phenoencoder
Bibliography:SourceType-Working Papers-1
ObjectType-Working Paper/Pre-Print-1
content type line 50
Competing Interest Statement: The authors have declared no competing interest.
ISSN:2692-8205
2692-8205
DOI:10.1101/2024.12.06.625879