Beyond the Alphabet: Deep Signal Embedding for Enhanced DNA Clustering

The exponential growth of digital data has fueled interest in DNA as a storage medium due to its unmatched density and durability. However, clustering the billions of reads required for error correction and data reconstruction remains a major bottleneck, as traditional edit-distance-based methods are both computationally expensive and prone to data loss. This paper introduces a novel \emph{signal-model} that processes raw Nanopore signals, bypassing the error-prone basecalling step. By directly leveraging analog signal information, the \emph{signal-model} reduces computation time by up to three orders of magnitude compared to edit-distance approaches, while delivering superior accuracy. It also outperforms DNA sequence embedding methods in both accuracy and efficiency. Furthermore, our experiments show that the \emph{signal-model} achieves higher clustering accuracy than existing strand-based algorithms, saving days of computation without compromising quality. Overall, this work represents a significant breakthrough in DNA data storage, highlighting how signal-based analysis can drastically improve both accuracy and scalability.

Currently under review.