Data Deduplication
ExaGrid looked at the first generation, traditional approaches to data deduplication and saw that all vendors had used block-level deduplication. This traditional method splits data into 4KB to 10KB groups of bytes called “blocks.” The backup software, due to CPU limitations, uses 64KB to 128KB fixed-length blocks. The challenge is that for every 10TB of backup data, the tracking table – or “hash table” – is one billion blocks. The hash table grows so large that it needs to be housed in a single front-end controller with additional disk shelves, an approach referred to as “scale-up.” As a result, only capacity is added as data grows and since no additional bandwidth or processing resources are added, the backup window grows in length as data volumes increase. At some point, the backup window becomes too long and a new front-end controller is required, known as a “forklift upgrade.” This is disruptive and expensive.
ExaGrid also saw approaches that used byte-level deduplication. Although this method allows for system scalability, known as “scale-out,” it requires an understanding of the format of every backup application, which limits the list of supported backup applications.
An alternative approach is to use hyper-converged scale-out nodes with block-level deduplication. However, this approach is still burdened with the large hash table look-ups and, therefore, requires expensive flash storage to increase performance, which increases the price of the hardware.
ExaGrid has taken a more innovative path. ExaGrid uses zone-level deduplication, which breaks data into larger “zones” and then compares at the byte level. This approach allows for the best of all worlds. First, the tracking table is 1,000th the size of the block-level approach and allows for full appliances in a hyper-converged scale-out solution. As data grows, all resources are added: processor, memory, and bandwidth as well as disk. If data doubles, triples, quadruples, etc., then ExaGrid doubles, triples, and quadruples the processor, memory, bandwidth, and disk so that as data grows, the backup window stays at a fixed length. Second, the zone approach is backup application agnostic, allowing ExaGrid to support virtually any backup application. Lastly, ExaGrid’s approach does not maintain a very large, ever-growing hash table and, therefore, avoids the need for expensive flash to accelerate hash table look-ups. ExaGrid’s approach keeps the cost of the hardware low.
In summary, block-level deduplication drives a scale-up architecture that only adds disk as data grows, or with a scale-out node approach requires expensive flash storage to perform large hash table look-ups. Both approaches slow down backups and/or increase cost. ExaGrid’s zone-level deduplication includes full server appliances in a scale-out hyper-converged solution without large hash table look-ups, which results in the fastest backup and restore performance at the lowest price. ExaGrid’s approach also supports a wide range of backup application support. This zone-level approach provides the best of all worlds: ExaGrid can work with any backup application and can easily scale, resulting in a fixed-length backup window regardless of data growth.