DNA digital data storage

DNA digital data storage is the process of encoding and decoding binary data to and from synthesized strands of DNA.

While DNA as a storage medium has enormous potential because of its high storage density, its practical use is currently severely limited because of its high cost and very slow read and write times.

In June 2019, scientists reported that all 16 GB of text from the English Wikipedia had been encoded into synthetic DNA.

Encoding methods

Many methods for encoding data in DNA are possible. The optimal methods are those that make economical use of DNA and protect against errors.

Encoding text

Several simple methods for encoding text have been proposed. Most of these involve translating each letter into a corresponding "codon", consisting of a unique small sequence of nucleotides in a lookup table. Some examples of these encoding schemes include Huffman codes, comma codes, and alternating codes.

Encoding arbitrary data

To encode arbitrary data in DNA, the data is typically first converted into ternary (base 3) data rather than binary (base 2) data. Each digit (or "trit") is then converted to a nucleotide using a lookup table. To prevent homopolymers (repeating nucleotides), which can cause problems with accurate sequencing, the result of the lookup also depends on the preceding nucleotide. Using the example lookup table below, if the previous nucleotide in the sequence is T (thymine), and the trit is 2, the next nucleotide will be G (guanine).

Various systems may be incorporated to partition and address the data, as well as to protect it from errors. One approach to error correction is to regularly intersperse synchronization nucleotides between the information-encoding nucleotides. These synchronization nucleotides can act as scaffolds when reconstructing the sequence from multiple overlapping strands.

In vivo

The genetic code within living organisms can potentially be co-opted to store information. Furthermore synthetic biology can be used to engineer cells with "molecular recorders" to allow the storage and retrieval of information stored in the cell's genetic material.

In-vivo light-based direct image and data recording

A proof-of-concept in-vivo direct DNA data recording system was demonstrated through incorporation of optogenetically regulated recombinases as part of an engineered "molecular recorder" allows for direct encoding of light-based stimuli into engineered E.coli cells.

This approach leverages the editing of a "recorder plasmid" by the light-regulated recombinases, allowing for identification of cell populations exposed to different stimuli. This approach allows for the physical stimulus to be directly encoded into the "recorder plasmid" through recombinase action. Unlike other approaches, this approach does not require manual design, insertion and cloning of artificial sequences to record the data into the genetic code. In this recording process, each individual cell population in each cell-culture plate culture well can be treated as a digital "bit", functioning as a biological transistor capable of recording a single bit of data.

History

The idea of DNA digital data storage dates back to 1959, when the physicist Richard P. Feynman, in "There's Plenty of Room at the Bottom: An Invitation to Enter a New Field of Physics" outlined the general prospects for the creation of artificial objects similar to objects of the microcosm (including biological) and having similar or even more extensive capabilities.

One of the earliest uses of DNA storage occurred in a 1988 collaboration between artist Joe Davis and researchers from Harvard University. The image, stored in a DNA sequence in E.coli, was organized in a 5 x 7 matrix that, once decoded, formed a picture of an ancient Germanic rune representing life and the female Earth. In the matrix, ones corresponded to dark pixels while zeros corresponded to light pixels.

In 2007 a device was created at the University of Arizona using addressing molecules to encode mismatch sites within a DNA strand. These mismatches were then able to be read out by performing a restriction digest, thereby recovering the data.

In 2011, George Church, Sri Kosuri, and Yuan Gao carried out an experiment that would encode a 659�kb book that was co-authored by Church. To do this, the research team did a two-to-one correspondence where a binary zero was represented by either an adenine or cytosine and a binary one was represented by a guanine or thymine. After examination, 22 errors were found in the DNA.

In 2012, George Church and colleagues at Harvard University published an article in which DNA was encoded with digital information that included an HTML draft of a 53,400 word book written by the lead researcher, eleven JPEG images and one JavaScript program. Multiple copies for redundancy were added and 5.5 petabits can be stored in each cubic millimeter of DNA.

In 2013, an article led by researchers from the European Bioinformatics Institute (EBI) and submitted at around the same time as the paper of Church and colleagues detailed the storage, retrieval, and reproduction of over five million bits of data. All the DNA files reproduced the information with an accuracy between 99.99% and 100%.

In 2013, a software called DNACloud was developed by Manish K. Gupta and co-workers to encode computer files to their DNA representation. It implements a memory efficiency version of the algorithm proposed by Goldman et al. to encode (and decode) data to DNA (.dnac files).

The long-term stability of data encoded in DNA was reported in February 2015, in an article by researchers from ETH Zurich. The team added redundancy via Reed�Solomon error correction coding and by encapsulating the DNA within silica glass spheres via Sol-gel chemistry.

In 2016 research by Church and Technicolor Research and Innovation was published in which, 22 MB of a MPEG compressed movie sequence were stored and recovered from DNA. The recovery of the sequence was found to have zero errors.

In March 2017, Yaniv Erlich and Dina Zielinski of Columbia University and the New York Genome Center published a method known as DNA Fountain that stored data at a density of 215 petabytes per gram of DNA. The technique approaches the Shannon capacity of DNA storage, achieving 85% of the theoretical limit. The method was not ready for large-scale use, as it costs $7000 to synthesize 2 megabytes of data and another $2000 to read it.

In March 2018, University of Washington and Microsoft published results demonstrating storage and retrieval of approximately 200MB of data. The research also proposed and evaluated a method for random access of data items stored in DNA.

Research published by Eurecom and Imperial College in January 2019, demonstrated the ability to store structured data in synthetic DNA. The research showed how to encode structured or, more specifically, relational data in synthetic DNA and also demonstrated how to perform data processing operations (similar to SQL) directly on the DNA as chemical processes.

In April 2019, due to a collaboration with TurboBeads Labs in Switzerland, Mezzanine by Massive Attack was encoded into synthetic DNA, making it the first album to be stored in this way.

In June 2019, scientists reported that all 16 GB of Wikipedia have been encoded into synthetic DNA.

The first article describing data storage on native DNA sequences via enzymatic nicking was published in April 2020. In the paper, scientists demonstrate a new method of recording information in DNA backbone which enables bit-wise random access and in-memory computing.

Davos Bitcoin Challenge

On January 21, 2015, Nick Goldman from the European Bioinformatics Institute (EBI), one of the original authors of the 2013 Nature paper,

Almost three years later on January 19, 2018, the EBI announced that a Belgian PhD student, Sander Wuyts, of the University of Antwerp and Vrije Universiteit Brussel, was the first one to complete the challenge.

The Lunar Library

The Lunar Library, launched on the Beresheet Lander by the Arch Mission Foundation, carries information encoded in DNA, which includes 20 famous books and 10,000 images. This was one of the optimal choices of storage, as DNA can last a long time. The Arch Mission Foundation suggests that it can still be read after billions of years.

DNA of things

The concept of the DNA of Things (DoT) was introduced in 2019 by a team of researchers from Israel and Switzerland, including Yaniv Erlich and Robert Grass.