Bill Andrews, ExaGrid’s President and CEO
Data&StorageAsean: What is the difference between deduplication and other Data Reduction technologies such as compression?
Bill Andrews: Compression eliminates white space and can see the reduction of data by 1.6:1 to as much as 2:1. It is good for primary storage data. Deduplication compares one data set to another and only stores the changes at the byte or block level, so it is especially good for backup as the data set from week to week is very similar. For example, if you back up 100TB and keep 18 retention copies, the amount of space required is 1.8PB and if you compress that, you might reduce that down to 1PB.
Alternatively, deduplication compares one backup to the next and keeps only the changes, which amounts to about 2% per week. So after 18 weeks, the required space for deduplicated data would be only 90TB - about 1/20th the space of the 18 copies without deduplication.
In summary, 18 copies of 100TB require 1.8PB of storage; however, with compression, the storage required reduces to about 1PB, and with deduplication, storage reduces even further to 90TB. The deduplication ratio in this example is 20:1, meaning that the 1.8PB was reduced by 20X to 90TB. Deduplication ratios are impacted by data type, length of retention, the number of copies kept, and each’s vendor’s deduplication algorithms. Some deduplication algorithms in the above example would get 2:1, 3:1, 5:1, 8:1 all the way to 20:1, and deduplication algorithms in backup software all get less than 10:1 because they don’t have enough compute to support stronger ratios. Dedicated target appliances that have dedicated CPU and memory are able to run more aggressive algorithms and get higher ratios like 20:1. The better the ratio, the less storage required.
Data&StorageAsean: Why do the use cases we see for deduplication seem to be limited to backup appliances and all flash arrays?
Bill Andrews: Deduplication in primary storage does not do any better than compression as each file is a single copy. The time required to deduplicate the data and subsequently rehydrate it for each user request is too slow, so deduplication is not used in traditional primary storage. However, it is used for flash/SSD for two reasons. First, flash/SSD is fast enough to deduplicate and rehydrate, and since SSD is about 8 to 10 times the price of standard disk, it is required to try and reduce the price delta between the two. Deduplication is mostly used in backup because with backup, you keep weekly, monthly, and yearly versions and can keep between 10 to 100 copies. Each copy is not that different from the previous copy and as a result, deduplication has a massive impact in the backup world.
Data&StorageAsean: Are there different approaches to deduplication and if so what are the benefits and downsides of each?
Bill Andrews: There are numerous approaches to deduplication.
a. Fixed length block: 64KB, 128KB, 256KB, etc. These are used in the backup applications as they have limited CPU.
1. Provides some level of deduplication from 4:1 to 8:1
1. Uses a lot more disk and costs more for storage as retention grows
2. Uses more bandwidth to replicate offsite
3. Are inline so they deduplicate on the way to disk, which reduces performance on top of the fact that this approach steals compute from the backup application
4. Restores are slow since all data is stored in deduplicated form, so the data needs to be rehydrated for each request
b. 8KB variable length block, typically used in scale-up dedicated deduplication appliances
1. Provides 20:1 deduplication
1. Are inline so they deduplicate on the way to disk which reduces performance on top of the fact that this approach steals compute away from the backup application
2. Restores are slow since all data is stored in the deduplicated form, so the data needs to be rehydrated for each request
c. Zone level with byte level compare – used in scale-out dedicated deduplication appliances that include a front-end disk cache
1. Provides 20:1 deduplication
2. Is at least 3 times faster for backups as data is written directly to disk and then deduplicated in parallel with the backup process, but offline
3. The most recent data is kept in an undeduplicated form, so restores and VM boots are 20X faster
Data&StorageAsean: Is deduplication technology relevant as companies virtualise and cloud enable?
Bill Andrews: Deduplication is always relevant when you have multiple copies over time of very similar data, which is exactly what backup data is since version or historical copies are kept. Deduplication saves on storage and bandwidth regardless of whether the data lives in a private, hybrid, or public cloud. Storage is storage and bandwidth is bandwidth, and if you can use less you will always save money.
Data&StorageAsean: Are there any unique features you would like to share about your own deduplication offerings?
High Performance Hyper-converged Secondary Storage for Backup with Data Deduplication
ExaGrid’s disk backup with deduplication system is the only solution purpose-built for backup that leverages a unique architecture optimized for scalability, performance, and price. The system scales as needed by adding ExaGrid appliances, which virtualize into a single scale-out system automatically, adding capacity and processing power while acting and being managed as one unified system.
Fastest Backups for the Shortest Backup Window
ExaGrid provides a unique disk landing zone in each appliance where backups are written directly to disk so that the compute-intensive data deduplication process doesn’t impact ingest. This approach provides the fastest backup ingest rate of any other deduplication solution. ExaGrid uses “adaptive” deduplication to deduplicate and replicate data to the disaster recovery (DR) site during the backup window (in parallel with the backups) but not inline between the backup application and the disk. This unique combination of a landing zone with adaptive deduplication provides for the fastest backup performance, resulting in the shortest backup window as well as a strong disaster recovery point (RPO).
Fastest Restores, VM Boots, and Offsite Tape Copies
Since ExaGrid writes directly to a disk landing zone, the most recent backups are kept in their full undeduplicated, native form. All restores, VM boots, and offsite tape copies are fast as the overhead of the data rehydration process is avoided. As an example, ExaGrid can provide the data for a VM boot in seconds to single-digit minutes versus hours for inline data deduplication backup storage appliances that only store deduplicated data. ExaGrid maintains all long-term retention (weeks, months, years) in a deduplicated format for storage efficiency.
Fixed-Length Backup Window
Since data deduplication uses a lot of processor and memory resources, as data grows, the amount of data deduplication to be performed grows as well. The first generation of deduplication storage appliances utilise a “scale-up” storage approach with a fixed resource front-end controller and disk shelves. As data grows, they only add storage capacity. Because the compute, processor, and memory are all fixed, as data grows, so does the time it takes to deduplicate the growing data until the backup window is so long that the front-end controller has to be upgraded (called a “forklift” upgrade) to a larger/faster controller which is disruptive and costly. ExaGrid provides full appliances in a scale-out system. Each appliance has landing zone storage, deduplicated repository storage, processor, memory, and network ports. As data volumes double, triple, etc., ExaGrid doubles, triples, etc. all required resources to maintain a fixed-length backup window. If the backups are six hours at 100TB, they are six hours at 300TB, 500TB, 800TB, etc. Expensive forklift upgrades are avoided, and the aggravation of chasing a growing backup window is eliminated.