Storage Soup - A SearchStorage.com blog

Storage Soup:

 

A SearchStorage.com blog


A data storage blog offering commentary on the storage industry, as well as a behind-the-scenes look at developments in storage management, SAN, NAS, backup, disaster recovery and storage strategy.

Protecting millions of small files

Every week, I visit IT professionals and I often hear the same complaint about dealing with a file server environment that has grown out of control. The problem is that these file servers have millions of small files and customers are looking for ways to better protect this file data.

Second, disk-based archiving truly fixes areas of the backup that most D2D solutions do not. Customers are highly frustrated with backup applications stumbling over what I call the “millions of small files issue.” This is primarily caused by the never-ending growth of a standard file server’s data. Most backup applications struggle with this millions of files scenario. Customers are counting on D2D to help, and it will… a little. The target disk may be faster, but mostly it is much more forgiving than tape. Tape needs to stream, or be fed a constant flow of data, in order to reach maximum write performance. Millions of small files make it difficult for those tape drives to be fed consistently. Disk backup, on the other hand, will maintain the same write performance no matter how inconsistent the data feed is.

That solves half the backup problem. The other half of the performance problem with millions of small files backup is that the backup software still needs to walk those millions of small files, identifying which ones need to be backed up. This file system walk can be very time consuming. Then, the backup software needs to update its own database that tracks what files were backed up and where. Imagine adding millions of records to a database every night, as fast as possible. That database gets HUGE in a hurry, can easily be corrupted and again, even if everything goes right, is very time consuming. Lastly, with most D2D backup solutions you still need to send the entire data load across the network. Even with deduplication solutions, the entire data payload needs to get to the appliance before deduping happens. All of this consumes network bandwidth. Disk-based archiving may circumvent or delay the need to upgrade network bandwidth by clearing this old data out of the way.

Disk-based archiving eliminates the problem of moving most of these millions of files. With disk-based archiving, the “old” files are stored on the archive and no longer need to be backed up. They are safer on disk than they are on tape (data integrity checking and replication) and they are out of the way. The backup software no longer needs to walk those files to find which ones need to be protected, send the files across the wire to be backed up and they do not consume disk space on file server or the D2D backup target. Additionally, since the archive is disk and not tape, you can be more aggressive with what is archived.

With a classic tape-based archive, customers will wait for data to get very old before moving it to tape. In addition, they will invest in elaborate data movers to provide transparent access to tape. Lastly, data that has stopped changing but is still being referenced or viewed cannot move to tape at all. With a disk-based archive, the delivery back to the user is relatively fast, so you can be more aggressive with your move to archive disk storage and there is less of a need to build elaborate access schemes. Most disk-based archives simply show up as a share on the network and you can archive reference data, further eliminating the data that needs to be protected by traditional backup methods.

A disk-based archive is the perfect compliment to D2D backup. It will reduce the investment in disk needed for backup and an archive strategy may pay for its self on this reduction alone. This is because a disk-based archive will clear out the fixed data (data that has stopped changing), making the investment in the software modules required by most backup applications for D2D cheaper (since they charge on stored capacity) and disk-based archives reduce the disk capacity of the disk backup as well as on the primary (expensive) disk needed on the file server.

What does this look like in hard costs savings? Disk-based archiving can reduce primary storage requirements (at least 10X dollar saving: $4 vs. $43/GB) and they can reduce backup requirements (fixed information is said to occupy, on average 50% or most enterprise primary disk capacity) saving them an additional $6/GB.

For more information please email me at georgeacrump@mac.com or visit the Storage Switzerland Web site at: http://web.mac.com/georgeacrump.

The Linux effect on storage

Linux is currently used in about 20% of the medium to large sized data centers, and according to some reports, it will be in some 33% of data centers before the end of the year. By 2011, it is expected that most data centers will have at least half of their environment running some flavor of the Linux OS. As this platform really begins to settle in, it is important to consider the ramifications that it will have on storage, data protection and disaster recovery.

When I look at how a supplier handles coverage of a platform, I compare it to the games checkers and chess. When a supplier has “checker coverage”, that means they have just enough support of the platform to be able to get a check mark. When I say they have “chess coverage”, that means they have deep coverage, including specific databases that are popular on the platform.

Looking at the foundation of data protection, backup software is a good place to start. Most of the major suppliers certainly have “checkers” type coverage of the Linux environment. Most have Red Hat and maybe Suse variants covered, but some still only support Linux as a client, meaning that the Linux servers cannot have locally attached tape. As your Linux environment grows, this can be a real problem. A handful of the backup software suppliers have also ported over their Oracle hot backup modules, and while Oracle on Linux is significant and growing, the MySQL install base seems to be growing faster. And, while in the past the size of the MySQL data set was not nearly as large as the Oracle data set, it seems to be catching up there as well. A little farther behind is PostgreSQL, but it still has a significant install base and it too seems to be growing. So, it is important that your backup application supports more than just Oracle and can do more than just hot database backup, being granular to the table space level to help with faster backups and recoveries for example.

There are backup applications that support Linux completely, and there is no longer a need to sacrifice. This may mean supporting two backup applications in the enterprise: one for Windows and one for Linux. But, as I have said in past articles while not ideal, that is not unacceptable, especially if it means you significantly improve your level of data protection on the second platform. You may find that your new product provides as good as or even better support than your original one.

When looking at core storage the situation is equally interesting. For block-based storage or SAN storage, basic support or “checker coverage” seems to be there across the board. Most of the SAN vendors support fibre attaching Linux servers to their SAN storage and their growing support for iSCSI connections. There is not much support beyond this basic connectivity though. There is limited support for boot from SAN.

Interestingly, when it comes to SAN-based storage the manufacturers have created modules for specific applications that allows their SAN arrays to better interact with them. For example, they might have a module for Exchange that will quiesce the Exchange environment, take a clean snapshot and then mount that snapshot to a backup server for off-host back up. Despite the increased growth of the Linux install base, and especially the growth of MySQL and PostgreSQL in that environment, we have not seen many specific tools to protect these increasingly critical applications. You can write scripts to accomplish the above, and in many cases now you have to. But, it would be better to have this integrated into the storage solution, so you can avoid all the issues that surround homegrown scripts.

With Linux and NAS based storage, you have to be equally careful. The Linux file system is Unix, so that means working with a Windows Storage Server based NAS can often be problematic. In all fairness, a Linux based NAS often has problems with Windows clients. There are two options here. You could focus on the Tier 1 NAS providers that have the Unix and Windows files system differences mostly resolved. This has challenges in cost, but provides comfort and reliability. Another option is to use a virtualized network file management tool. With a network file management product you can have both a Windows NAS and a Linux NAS and have data directed to the appropriate NAS based on data type, allowing for a seamless support of both file systems. Of course, a network file management product delivers far more than this. For example, it can enable a migration of data as it ages to a disk-based archive or it can help with migration to a new NAS platform all together.

Disaster recovery is another point of consideration. If you are replicating at the SAN level, then the SAN storage controller itself can cover most of this. But if all of your Linux data is not on a SAN then you may have issues with replication of disaster recovery data. With the available replication software applications, you have some very Linux focused applications but not many that can cover the enterprise. Replication is an area where you don’t want to have too many different tools to monitor and manage. Focus on finding a solid multi-platform tool than can replicate Linux, Unix and Windows data.

Linux is going to be increasingly important in enterprises of all sizes and it seems that the traditional market leaders in storage are going to ignore the platform or give it just “checker” type of coverage. The new players on the market are taking advantage of this and are moving quickly to fill the void. It is interesting to note that most of the manufacturers that have a strong Linux solution also have an equally strong Windows and Unix solution. So, in only providing the very basic of support for Linux the market leaders may end up ceding the entire enterprise.

For more information please email me at georgeacrump@mac.com or visit the Storage Switzerland Web site at: http://web.mac.com/georgeacrump.

Users: We need remote office data protection

One of the common questions I am getting from IT people that I meet with is how can they protect remote offices, typically those with no local, at least officially anyway, IT staff. It is important to accept upfront that you may need a couple of solutions to address this challenge. This is especially true if you have remote offices that vary in size. There may be local databases and, more often than not, local file servers.

The first option is to eliminate the problem by eliminating the local need. Products like Citrix or Windows Terminal Server can eliminate the need for local applications and a wide area file system (WAFS) can eliminate the need for local file servers. The Citrix/Windows Terminal server solutions are best explained by those two companies so we’ll spend our time on WAFS. WAFS essentially places a cache at the remote site. At a high level, this cache is a server appliance with a small local disk that can replicate changes as they happen to a central server at a primary data center. The most frequently accessed data is stored on the remote file server, which serves up data to those users in a local fashion and at local performance. Typically, proprietary but enhanced network protocols are utilized to get higher performance on data transfers for data that is not in the remote cache. Some of the WAFS companies are also providing the ability to utilize data deduplication on the network, in a similar fashion to how some disk-to-disk backup products use data deduplication to optimize storage. NAS that can do data deduplication would be an ideal central NAS for this environment since typically there is a high level of file redundancy between remote offices and the primary data center.

From a data protection standpoint, the centralized repository for all the remote cache’s data is now also a server at your primary data center and can fall under the umbrella of your normal protection scheme. Other advantages to WAFS is that it can eliminate the need for buying additional servers and storage for remote offices, delay the need for bandwidth upgrades and can even enable better collaboration between offices.

Another option is to use replication. This is ideal for sites that in addition to a remote file server also have some remote database applications or email, especially if there are just a few of this type of sever present. While most data replication products are considered for disaster recovery solutions, they also make for an ideal remote or branch office backup solution. With these products all data is replicated as it changes to a centralized disk at the primary site. This disk can then be mounted to a backup server and backed up locally at the primary data center. Cost can be a concern if the local server count is more than just a few servers and you are not leveraging the other value points (disaster recovery and failover) of replication. Also, unless your data replication solution can produce snapshots to freeze moments in time or you can leverage snapshot technology by replicating to storage that supports it at the primary data center, be aware that if you do experience corruption at the remote site, you will then very quickly replicate that corruption to the primary site.

If you don’t need the near instant protection of WAFS or replication you can also leverage some D2D solutions to perform a backup locally and utilize the D2D solution’s replication capability to send that data to a similar device at the primary data center. This typically makes sense when you have a fairly large data set or number of servers and maybe a small remote IT staff. It will also almost certainly require a data deduplication device, as straight disk-to-disk backup will not work well for remote backups. You need to be able to deploy a D2D solution that can leverage data deduplication. Standard backup to disk essentially consists of several large backup files. Those files are typically created too fast and are too large to be copied across the common WAN segment in time to meet the nightly backup window. Data deduplication appliances on the other hand only have to send the changed block between the backup jobs, greatly reducing the WAN requirement. Also, you now have the data locally at the branch office instead of only at the primary data center, allowing for faster recovery at the branch if needed.

In the past four or five years, we have expanded from hardly any options for remote office data protection to many. Which of these solutions you deploy is a function of budget and business requirement, and in some cases it may make sense to blend the solutions. Assessing the needs of the remote offices while focusing on the business realities at the primary data center, will help you make those choices.

For more information please email me at georgeacrump@mac.com or visit the Storage Switzerland Web site at: http://web.mac.com/georgeacrump.