Jul 23, 2012

Posted by in Featured, File Systems, FreeBSD, Operating Systems, Papers | 2 Comments

ZFS – The Zettabyte File System

Although ZFS, Zettabyte File System is not a new technology, it has many features to be admired to itself and it is an open source file system.

ZFS is development by the company of Sun Microsystems in 2000 which has announced to IT industry in 2004. It has been implemented to Solaris 10 in 2005. In 2007, it has been implemented to FreeBSD by a group of Illumos. In FreeBSD 8.2, ZFS’s version is 15, while version is 30 in FreeBSD 9.0 . In summary FreeBSD implementation is still continuing.

Zettabyte File System can be used in Oracle Solaris, FreeBSD, NetBSD, FreeNAS, MacOS X and other operating systems which has Linux kernel. ZFS is distributed under CDDL (Common Development and Distribution License),which is Sun Microsystem’s license, so a few more steps needed to use on Linux. Because of CDD license is incompatible with the GPL (General Public License), so it was utilized by FUSE. On Linux to able to use ZFS, need to make some changes at the kernel which means abuse of the GPL license. So it wasn’t available on linux by default. If you ask what is the current situation of using ZFS in Linux, it is as easy as installing a package on Linux. It can be used kernel module by adding to the system just like installing packages in Linux. For more detailed information, you can look at some resources such as “ZFS on Linux” or “ZFS-FUSE“. It ‘s totally up to you which the way that you will use on Linux.

The project leader of ZFS who is a kind of father, Jeff Bonwick described the ZFS that is “The Last Word in File Systems”. Actually he is right, beause ZFS is the most advanced file system of today. ZFS amazed me when I read some topics about it. And I believe that you will be amazed by ZFS too.

So what makes ZFS so cool ?

Features of ZFS

128-bit file system

Support of 256 quadrillion zettabyte

The storage pool concept for the management of physical disks

With the addition of disk storage, file system size can expand dynamically.

File systems can be spared on multiple disks without any hard disk limitations.

Pools and file systems are mounted automatically. There is not any need of making entry in /etc/fstab(/etc/vfstab) anymore

Easily administration

Transactional file system

No physical limitation that the other file systems has

The only file system that has self-healing feature on the world

The integrity of data with checksums that makes there is no need to use fsck tool anymore

Volumes concept

RAID-0/1 (striping & mirroring ) , RAID-Z (Enhanced RAID-5 – RAID-6)

High availability

Reducing cost of hardware

Deduplication

Customizable settings with properties of ZFS

Snapshots

Clones

ZFS Send/Receive

Dynamic Resilvering

Dynamic Scrubing

Data Migration

Cache device support

SAN support

Hardware Requirements of ZFS

64 bit dual core/multi-core processor

The amount of minimum 1GB RAM. For high performance, recomended to use 4GB or over that

Amount of minimum disk size must be 128 MB

 

Scalability

What is zettabyte too ? Zettabyte is a measurements that we use for determining the datas’s size such as megabyte, gigabyte, etc… We can arrange a relationship between this units like below.

Kilobyte → Megabyte → Gigabyte → Terrabyte → Petabyte → Exabyte → Zettabyte

1 Zettabyte = 1024 Exabyte

1 Zettabyte = 1 billlion TB

The whole digital data on all over the world can be adressed by ZFS with 256 quadrillion zettabyte support . Because ZFS is 128- bit file system. When we compared ZFS with the 64-bit file systems, this means is ZFS can be addressed 16 billion times more. Today, there is not a system which is need of such a data field. If 256 quadrillion zettabyte was in the real life, it would boil of the oceans. If you don’t believe this, you can check out the article about the theoretical explanations.

There are many theoretical limits even though it is not possible to reach in such a manner size of data by only one system. There is no restriction on the number of folders or files. In addition, the data is not stored inodes, that is stored in blocks. Shortly, ZFS doesn’t have inode terms.

File systems can span onto different disks. In UFS file systems and other file systems, the data cannot be spanned onto another disks. A file’s maximum size can be reach to 16 exabytes. Which means there is no limitation per a file cause of there is no such a file with this size in the world.

 

Storage Pool Concept

Storage pool concept is used to manage the physical disks. Storage pool defines the characteristics of the physical disks, which disks are included to storage pool such as “mirrored”, “backup”, etc… Storage pool describes the pyhsical disks those on which purpose will be used for . ZFS also supports the mechanism of volume.

Storage pool concept is similar to the volume concept. So what is the mechanism of volume, and why it is used for.

Firstly, I think we have to understand the volume concept. Traditional file systems’ architecture is like the figure at below, before the volume mechanism is not come to the industry. Each file system was ruled by a single disk.

IT sector was in the face of increasing needs and more easily administratively solution. And developers brought volume mechanism in our lives. Once, volume mechanism’ products sold as separate and they did not pass in front of some problems. So, solution could not be literally in fact

The Comparison of ZFS’ Volume Model and Traditional Volume Model

In traditional volume model that each file systems must have partition or volume. When you encounter in need of growing file system size, you have to do this operation manuelly. Each file system has limits and each file system has seperated from each other.

 

In ZFS pooled storage there is not managed partition. When you want to reduce or englarge the size of file systems, you can make this operations automatically and easily with only one command. All data is shared state in the storage pool.

Easily Administration

ZFS is controlled by only two commands (zpool and zfs ). You can create large file systems with one command. With another command you can specify the file systems’ or storage pool’s settings considering for yours usage purpose or needs.

Only with one command, you can create large file systems such as mirrored and raidz structured disks. In the other hand, ZFS’s easily administration consume less time while rebuilding the disks if disks encounter with any of failures.

If we want to create storage pool that will be available to usage by typing only one command.

# zpool create vizyonargepool /dev/da0

At the same time when you create storage pool, that will be mounted to the root filesystem hierarchie automatically without using traditional mount command or adding an entry to the /etc/fstab for just making it permanent. It will automatically mount without performing any action.

The mount name is the same as the storage pool name. For example, we have created pool storage with the name of “vizyonargepool” above. This storage pool will be mounted automatically just like this /vizyonargepool in the root file system hierarchie. But we can change the mountpoint of the storage pool with zfs set command and arguments just like below.

# zfs set mountpoint=/opt/vizyonarge  vizyonargepool

You can create numerous file systems under another filesystem such as parent and child  file system hierarchies which are presented in the storage pool. The top of the file system’s properties can be inherited by new file system which is created under the parent filesystem. This features are pretty much such as compression,quotas,checksum, quotas, reservation, etc…

 

Transactional Semantics

ZFS is a transactional file system. Which means the status of the file system is always consistent on the disk.

Traditional file systems mechanism is different from ZFS’s mechanism. For example; suppose you copy a file the size of 80 mb to disk which filesystem’s type is UFS. When you started to copy that file to the system, data block is separated and it link to file by UFS. During the operating system processes, if the system fail such as power outage or a hardware error situations, the file system will remain inconsistent state.

To solve this problem, you know as a system admin, we use “fsck” tool to fix inconsistencies of datas in the file system. The purpose of using this tool to fix data corruptions and get the system to apropriate run level or state. On the other hand, the corruption of datas may not be fixed by fsck tool completely. If you encounter with a failure on the critical system which have to run 7/24 , by reason of any error, the datas may be corrupted and needed to run fsck facility by manually. If the pyhsical disks’ size are too large, fsck utiliy will consume so much time to get on the critical machine. It is totally waste of time.

Some file systems are comprised of “journaling” concept. In other words, every transaction is logged to a file. The latest data before occuring a system crash can be read here. However, in file systems, this solution is inadequate too.

Some file systems have journal structure. Every process in the file system records to a file . If the system encounters with power outage or hardware failure, the last datas can be read from these “journal” files. But this model is not producing a solution to the industry too

In transactional file systems datas are controlled by expressed “copy on write” model. The datas are not written to disk until writing operations are completely finished. If the data writing process is not completed and encountered with an unexpected failure, the data that were written is ignored by ZFS. When we faced with the failure of any system, the data on the disk ise always consistent state. So, you will never and ever have to use fsck utility in any case of failure.

Sun Microsystems clarifies this issue like this; if there is written data to the disk, that data will never be lost. And the data always will be in the disk. We can give an example to make it comprehensible.

Suppose we have a file with size of 100 MB. While copying 75 MB of 100 MB file to the disk, if the error occurs while carrying out the copy process and system fails, there will be no need to run fsck tool when the system is started.

In other file systems, only consistency of the metadata can be achieved.

Copy on write model is a detailed topic, I am going to discuss about it on another article.

 

Checksums and Self- Healing

Each data are written on the blocks in ZFS. Each block’s checksum is taken in ZFS. The accuracy of the data approved by 256-bit checksum mechanism.

In other file systems are expected to look for hardware errors and to fix the hardware errors. However, ZFS controls the data’s consistency with the 256-bit cheksums. You can choose the cheksums algorithms such as simple but fast as well as the Fletcher-2, or a slower but the more secure SHA-256 algorithm. You can set or specify the algorithm by typing only one commend.

If ZFS encounters a mismatched checksum in raidz or mirrored file system, gets the data from intact block and it write substantial data on corrupted blocks, so it is the only file system over the world that can heal itself without any interaction.

How works mechanism of self-healing ?

First of all, to understand how is self-healing technology works we have to understand how traditional file systems are working.

Mirroring mechanism in traditional file systems

Suppose that when the application demanded to read the data, it can not read the data because of the first disk’ block is broken. Volume manager sents a broken block from file system to application. If the sent data is the curropted that system state goes to panic cause of broken metadata. If the metadata is not curropted that sends to the application smoothly.

How works mechanism of self-healing in the ZFS ?

Suppose that the application demanded to read the data from one of the disk which is in a mirrored structure. When the checksum of data checked by file system and if there seem curropted data block, in this case ZFS’ demands to read data from second disk of mirrored structure. Intact data block sent to the application. And corrupted data block in first mirrored disk repaired by taken data from second mirrored disks’ intact block.So, the mirrored disks can heal theirself automatically.

Mirror: two or more virtual disk’ data which is device containing identical data in the storage pool.

Virtual device: logic device in the storage pool. That may be a physical device or  devices.

 

RAID-Z

We can use software –based with the ZFS. Thus, there is no need to take any hardware.

Features of RAID- Z

We can reduce even more fault tolerance with RAID-Z thereby specifying a single, double or triple parity. Single parity is same as RAID- Z 1 or RAID-Z 5, but it is not RAID-Z 5 exactly. RAID-Z 2 and RAID-Z 6 is similar but it is not RAID- 6 exactly too. One disk works in RAID Z and the two disk works as stable in RAID-Z until an error occurs.

RAID-Z is a full stripe written. In other words, when the blocks are written, we are expected to be written entirely. For example, we assume that we use RAID-Z and we make a configuration as in use single-parity at RAID-5. Blocks are striped, the data is written to disks, before writing all data onto blocks. In other words before writing the parity’s information if the system shutdown unexpectedly and after opening OS that data’s parity would be incompatible. This problem will be cause of incorrect data. Because of the other data or parity’s information are written but any data doesn’t have to be written.

The error is called write hole. Also ZFS prevents that with the mechanism, this mechanism is called full stripe written, and it enhances the integrity of data.

For example; when the copy of a file, we cancel the operation. It is not cause a write hole. When we cancel the operation parity’s information is updated and the data is deleted. When we don’t use ZFS, and if we use hardware-based raid there will be two situation. If we are using the battery of raid, while writing the parity, and if the system come up with failure, the hardware-based raid will recover.

If we are not activated to use of battery in RAID or runnig out of power and we don’t take necessary action, we can lose the data. Due to the features mentioned above, RAID-Z is may be more preferable for the costs and more reliable system.

*** RAID-Z is faster than the normal RAID.

High Availability

If errors occur in the disks, without needing to shut down the system we can replace the corrupted disks. ZFS Raid Management is at OS level. But RAID Hardware Management Software is in the BIOS. Besides without using fsck tool, we can ensure the  consistency of data on the disk,we can use features like self-healing, data checksums are available and we can avoid “write hole” bug which is encountered on hardware RAID.

ZFS is a life-saving technology on critical systems and also reduces the cost efficiently.

 

Deduplication

Deduplication is a property to prevent high disk usage in data storage. As we mentioned earlier, in ZFS datas are written to blocks. These blocks are being checksum and stored on disk. If a new block’s checksum is same as the blocks’ checksum written earlier in which case block is not written on the disk again instead it’s just linked to the older block.

While this process saving the disk usage, improves the I/O operations.

ZFS data deduplication is synchronous. Deduplication is made during the process of new data writing to disks. And below is the command for using it.

# zfs set dedup=on vizyonargepool

By default SHA-256 algorithm is used which concludes 10^77 long hash. We can change the hash algorithm and alternatively use Fletcher4 algorithm.

How much data storage capacity can we save with deduplication?

Purpose of using deduplication is same as the purpose of using your storage pool.

Virtualization Environments : Deduplication is a life-saving technology when it comes to using the same kernel, libraries, system files, and applications.

File Servers : Depends on the files to be stored on the server. If our server is for web applications, it can save significant amount of storage area.

Mail Servers : Deduplication is also very important for mail servers too. In some mail servers, same datas are being served from various locations. That’s why deduplication saves usage on I/O operations and storage space.

Back-up Disks : Some files can be backed-up over and over again by different users. For example when User B wants to back-up the files that User A backed up earlier, we can see the advantage of deduplication.

Web 2.0 and Social Networks : We can avoid the duplication of same contents shared by different users.

 

Customized Settings

After the initial creation of ZFS Storage Pool, we can change the settings of ZFS with only one command which is “zfs set”. Here are some of the most used settings.

atime : [ on ] This property determines whether to update or not to update the access time of the file. If our files are static, we can disable this property to gain performance.

# zfs set atime=off vizyonargepool

checksum : [ on ] [ fletcher4 ] As we mentioned earlier; checksum checks the integrity of the data while written to blocks. By default this property is enabled and uses Fletcher4 algorithm. But alternatively you can set this property’s algorithm to Fletcher2, SHA-256, SHA-256+MAC.

compression : [ off ] By default this property is disabled. Other options can be assigned to this property are on, lzjb, gzip and gzip-N.

# zfs set compression=gzip-9 vizyonargepool

** Despite compression is a really compelling property it should be noted that every compression process uses CPU and it will increase the cost.

copies : [ 1 ] Per file system, this property determines how many copies of the file to be created. Quotas and reservations are affected. This property must be set while creating the file system. If not so only new files will be affected.

dedup : [ off ] This property provides the unique file creation. This property’s value can be on, off, verify ve SHA-256. Default algorithm for this is SHA-256.

encryption : [ off ] Determines whether to encrypt the data or not. For accessing data, a key is required. This feature may be used on critical systems.

exec : [ on ] In the file system this property authorizes whether to execute or not.

mountpoint : [ N/A ] This property is used for changing the mount point.

# zfs set mountpoint=/opt/vizyonarge vizyonargepool

quota : [ none ] We can limit the size of the file system. With hard limit quota can be applied.

reservation : [ none ] Determines the minimum storage size for the file system. This property guarantees the storage size for the file system against situations like other file systems consume more than the reserved size.

sharenfs : [ off ] Determines whether enable NFS share or not.

 

Snaphosts

One of the best properties of the ZFS is that the file system can have snapshots. Snapshot is a read-only copy of the file system. Snapshots are kept in storage pools and does not consume space. It is easy as entering the command below to create a snapshot.

# zfs snapshot vizyonargepool@0112

# zfs list

When an old block is changed, before releasing the block ZFS firstly writes the data to the new block. But if there is a snapshot for the file system, the old block is not released. Which means old data is still accessible in the old snapshot. That’s why the increase of snapshot ’s size is depends on the old blocks.

The folder of the snapshot is present in the hidden folder .zfs under the file system which has been snapshot. If you want to destroy the file system, you will encounter with an error cause of earlier snapshot.

You can rollback to snapshot with the command below.

# zfs rollback vizyonargepool@0112

Clones

Clones can be created as the snapshots but these are file systems and unlike snapshots there are not read-only.

# zfs clone vizyonargepool@0112 vizyonargeclone

# zfs get origin vizyonargeclone

ZFS Send/Receive

With the zfs send command, by streaming the copy of snapshot; we can carry the snapshot to some other OS or storage pool on the same OS.

# zfs send vizyonargepool@0112 | zfs receive vizyonargereceivedpool

For backing up to a remote system we can use the command below.

# zfs send vizyonargepool@0112 | ssh backup@vizyonargepool.com zfs receive vizyonargebackup/ vizyonargepool

** While sending the snapshot to the remote system with full stream, it is not important if there is a file system.
Dynamic Resilvering

The process of transferring storage pool from one device to another is called resilvering. Suppose that we attached third disk two storage pool. The new disk will be mirrored from the other two disks by getting the data from other ones. This process also called,mirrored resynchronization .

New disk may be available in usage in just minutes but time depends on the datas’ size
Dynamic Scrubbing

In ZFS file system, we can check the consistency of the files with the scrub tool. We can easily fix the data errors on the disk that’s encountered. Here is the command below.

# zfs scrub vizyonargepool

Cache Device Support

Cache devices caches the data on the storage pool. We can specify this feature after the initial creation of storage pool.

It stand between the memory and disk and significantly increases the performance. Especially it increases the performance of reading static content randomly. If we choose SSD disks for cache devices, we can improve the performance further more.

Cache devices can’t be set as mirrored structure. In any case of failures, system continues running on stable state and starts using the original storage pool devices.

 

Data Migration

In need of data migration from one system to a different system we can use zpool export command. For instance if we take the disks from origin system to target system, in the target system we do not need to do anything except entering the zpool import command. All disk structures are available in the target system just like in the origin system.
SAN Support

In FreeBSD, by using of UFS, SAN is supported. But in case of an error the consistency in the datas may be lost. But if we use ZFS on FreeBSD 8 Release Versions, we can ensure the consistency of the datasets.
Conclusion

As a conclusion we can say, besides ZFS is open source and present at many operating systems, i mentioned some of its properties in the article above. In file systems such an innovative technology brings the freedom to IT industry. ZFS will be recognized in years not only by it provides the data to be more stable and consistent and also new properties will be added continuously.

Aside of reducing the cost of hardware, ZFS is a file system that can be used for many purposes and especially scalability and customizable which makes ZFS admired by lots of people.

It’s pretty exciting to talk about the new features of ZFS. Especially while there is not substantial amount of native sources.

Mustafa Resul ÇETİNEL

 

  1. well documented and excellent work. You have explained all features of ZFS clearly. Keep up the good work.

    Sincerely

  2. Yugandhar says:

    Its written in a very simple language to understand the things and how its work. really aapreciate to you for knowledge sharing..

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>