Skip to main content

Rethinking Learning-based Method for Lossless Genome Compression

Han Yang (Alibaba Group); Fei Gu (Alibaba Group); Jieping Ye (Alibaba Group)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
07 Jun 2023

Lossless genome compression plays a vital role in genomic analysis procedure. The main challenges are from long range of the genome sequence and high frequency of genome variants. However, existing learning-based methods almost all rely on local DNA fragments in small windows, which makes them unable to capture deep regularities of genome sequence and may lead to unsatisfactory performance because of the large variability in different individuals. In this paper, we redesign the deep learning model and propose a simple yet effective position-driven transformer for genome data compression. Our approach, called CompressBERT, is based on two core designs. First, we introduce global position of the complete genome sequence into our deep model, which can make the genome sequence distinguishable in base level. Second, we pre-train our deep model by identifying SNP genome variants, which can further facilitate genome compression task. Furthermore, the proposed CompressBERT is validated on three datasets from different species. Experimental results show that our approach outperforms state-of-the-art methods.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00