If you are an intermediate or advanced user of ZFS, then this post is probably not for you.

Introduction

I only recently started looking seriously at ZFS. So far general skepticism about "new stuff" taking care of my precious files, has kept ZFS from being used by me.

But no more. I am going to convert my home server to ZFS, when I get home from The Camp. This post will document some of the decisions and solutions I arrived at while at The Camp.

My requirements for storage are:

  • Full disk encryption
  • Resilience for single drive failures
  • Rock solid filesystem

So far these have been met by gmirror + GBDE + ufs2. But the possibility of having my filesystems share a pool of free space, have had me thinking about ZFS for some time. It would be nice not to have to move stuff around when one filesystem runs out of space.

I was a little worried about how disk failures would manifest themselves, and how they should be handled in order not to loose data. Thus I brought 3 USB disks to The Camp - one with known bad blocks.

Sven Esbjerg held a great ZFS workshop at The Camp, which I attended. It provided a nice crash course in ZFS. During the workshop, I played with setting up two disks as a mirror. When the one with bad blocks gave errors when adding data to the mirror, I could play with replacing it with the spare disk. All worked as expected.

While my requirements included full disk encryption, I also wanted raidz in order to gain more usable disk space.

Looking around I found this convoluted way of doing it, but I wanted to keep it (more) simple - even at the cost of not giving ZFS direct control of the disks. This blog post gave me a nice starting point.

In the end I opted for encrypting the devices with geli, and adding ZFS on top of that.

Configuring ZFS and testing failure

This is done on a laptop, using USB drives, called /dev/daN. Devices on a server would be something like /dev/adaN or /dev/adN.

I will create 3 encrypted devices, tell ZFS to use them, and create a pool spanning them all.

# Create encrypted devices
geli init -s 4096 /dev/da0      # Using a blocksize of 4096 to be prepared
geli init -s 4096 /dev/da1      #   for future disks that use 4K blocks
geli init -s 4096 /dev/da2      # da2 is the bad disk
# Attach the devices for use
geli attach /dev/da0
geli attach /dev/da1
geli attach /dev/da2
# Create the zpool (called "tank"), spanning all 3 devices
zpool create tank raidz1 /dev/da0.eli /dev/da1.eli /dev/da2.eli

I then started to fill-up the pool in order to trigger errors from the bad disk.

for i in `jot 10` ; do dd if=/dev/zero bs=1m count=10240 of=/tank/zero1.$i; done
for i in `jot 10` ; do dd if=/dev/zero bs=1m count=10240 of=/tank/zero2.$i; done
# ... etc ...

When the bad blocks were used, errors started to be logged in /var/log/messages, but nothing was registering in ZFS.

From /var/log/messages:
# ...
Jul 25 22:03:39 <kern.crit> guide kernel: (da2:umass-sim2:2:0:0): WRITE(10). CDB: 2a 0 c d0 64 20 0 0 80 0 
Jul 25 22:03:39 <kern.crit> guide kernel: (da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
Jul 25 22:03:39 <kern.crit> guide kernel: (da2:umass-sim2:2:0:0): Retrying command
Jul 25 22:04:53 <kern.crit> guide kernel: (da2:umass-sim2:2:0:0): WRITE(10). CDB: 2a 0 c d0 64 20 0 0 80 0 
Jul 25 22:04:53 <kern.crit> guide kernel: (da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
Jul 25 22:04:53 <kern.crit> guide kernel: (da2:umass-sim2:2:0:0): Retrying command
Jul 25 22:06:06 <kern.crit> guide kernel: (da2:umass-sim2:2:0:0): WRITE(10). CDB: 2a 0 c d0 64 20 0 0 80 0 
Jul 25 22:06:06 <kern.crit> guide kernel: (da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
Jul 25 22:06:06 <kern.crit> guide kernel: (da2:umass-sim2:2:0:0): Retrying command
Jul 25 22:07:20 <kern.crit> guide kernel: (da2:umass-sim2:2:0:0): WRITE(10). CDB: 2a 0 c d0 64 20 0 0 80 0 
Jul 25 22:07:20 <kern.crit> guide kernel: (da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
Jul 25 22:07:20 <kern.crit> guide kernel: (da2:umass-sim2:2:0:0): Retrying command
Jul 25 22:08:34 <kern.crit> guide kernel: (da2:umass-sim2:2:0:0): WRITE(10). CDB: 2a 0 c d0 64 20 0 0 80 0 
Jul 25 22:08:34 <kern.crit> guide kernel: (da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
Jul 25 22:08:34 <kern.crit> guide kernel: (da2:umass-sim2:2:0:0): Error 5, Retries exhausted
Jul 25 22:08:34 <kern.crit> guide kernel: GEOM_ELI: Crypto WRITE request failed (error=5). da2.eli[WRITE(offset=110071791616, length=131072)]
Jul 25 22:08:34 <kern.crit> guide kernel: GEOM_ELI: Crypto WRITE request failed (error=5). da2.eli[WRITE(offset=110071922688, length=131072)]
Jul 25 22:08:34 <kern.crit> guide kernel: GEOM_ELI: Crypto WRITE request failed (error=5). da2.eli[WRITE(offset=110072053760, length=131072)]
Jul 25 22:08:34 <kern.crit> guide kernel: GEOM_ELI: Crypto WRITE request failed (error=5). da2.eli[WRITE(offset=110072184832, length=131072)]
Jul 25 22:08:34 <kern.crit> guide kernel: GEOM_ELI: Crypto WRITE request failed (error=5). da2.eli[WRITE(offset=110072315904, length=131072)]
Jul 25 22:08:34 <kern.crit> guide kernel: GEOM_ELI: Crypto WRITE request failed (error=5). da2.eli[WRITE(offset=110072446976, length=131072)]
Jul 25 22:08:34 <kern.crit> guide kernel: GEOM_ELI: Crypto WRITE request failed (error=5). da2.eli[WRITE(offset=110072578048, length=131072)]
Jul 25 22:08:34 <kern.crit> guide kernel: GEOM_ELI: Crypto WRITE request failed (error=5). da2.eli[WRITE(offset=110072709120, length=131072)]
Jul 25 22:08:34 <kern.crit> guide kernel: GEOM_ELI: Crypto WRITE request failed (error=5). da2.eli[WRITE(offset=110071529472, length=131072)]
Jul 25 22:08:34 <kern.crit> guide kernel: GEOM_ELI: Crypto WRITE request failed (error=5). da2.eli[WRITE(offset=110071660544, length=131072)]
# ...

This kept going for a couple of hours (I kept adding data), while ZFS claimed that all was fine. Finally ZFS saw enough errors, and dropped the bad disk (i forgot to save the output of 'zpool status' here).

I detached the disk from the USB port, and ZFS shows:

guide ~ 90# zpool status
  pool: tank
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub in progress since Wed Jul 25 22:45:40 2012
        188G scanned out of 312G at 38.7M/s, 0h54m to go
        0 repaired, 60.17% done
config:

        NAME                      STATE     READ WRITE CKSUM
        tank                      DEGRADED     0     0     0
          raidz1-0                DEGRADED     0     0     0
            da0.eli               ONLINE       0     0     0
            da1.eli               ONLINE       0     0     0
            17292134696089706765  REMOVED      0     0     0  was /dev/da2.eli

I have lost a disk, but raidz allows for that, while still allowing access to all data.

Getting access to the pool after reboot

After reboot, the state is:

guide ~ 1# zpool status
  pool: tank
 state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
    replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-3C
  scan: none requested
config:

        NAME                      STATE     READ WRITE CKSUM
        tank                      UNAVAIL      0     0     0
          raidz1-0                UNAVAIL      0     0     0
            7434867841503891175   UNAVAIL      0     0     0  was /dev/da0.eli
            2223752624539388409   UNAVAIL      0     0     0  was /dev/da1.eli
            17292134696089706765  UNAVAIL      0     0     0  was /dev/da2.eli

Without having attached the geli devices, ZFS cannot find its data, and the pool is offline.

Attach the devices for use:

guide ~ 3# geli attach /dev/da0
Enter passphrase:
guide ~ 4# geli attach /dev/da1
Enter passphrase:

ZFS now finds the devices, but does not mount the data sets.

guide ~ 5# zpool status
  pool: tank
 state: DEGRADED
status: One or more devices has been removed by the administrator.
  Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0 in 1h58m with 0 errors on Thu Jul 26 00:43:43 2012
config:

        NAME                      STATE     READ WRITE CKSUM
        tank                      DEGRADED     0     0     0
          raidz1-0                DEGRADED     0     0     0
            da0.eli               ONLINE       0     0     0
            da1.eli               ONLINE       0     0     0
            17292134696089706765  REMOVED      0     0     0  was /dev/da2.eli

Mount the datasets.

guide ~ 8# zfs mount -a

My data are ready to be used again.

If this was not just a test setup, I would get a new drive and add it to the pool.

geli init -s 4096 /dev/da2      # Assuming that da2 was the new drive
geli attach /dev/da2
zpool replace tank 17292134696089706765 /dev/da2.eli

ZFS would now resilver the pool, and after a while the new drive would have all the needed data copied to it, and the pool would again have enough redundancy to allow for the loss of one drive.

Keeping an eye on ZFS

In order to get reminders about the state of ZFS in the nightly emails, I have added ZFS status to /etc/periodic.conf. Furthermore, I have asked the system to run scrub on the pool every 8 days.

guide ~ 22# cat /etc/periodic.conf 
daily_status_zfs_enable="YES"
daily_status_zfs_zpool_list_enable="YES"
daily_scrub_zfs_enable="YES"
daily_scrub_zfs_default_threshold=8

Had this been my server, I would now have a good feeling about being informed about the state of my filesystems.

Further thoughts

It is my experience from these tests, that ZFS does not like it when drives belonging to a pool disappear, not even when all filesystems in that pool has been unmounted. Thus I would not recommend using USB drives as removable storage with ZFS for production use.

When replacing a failed drive in a raidz (or a mirror), the new drive should be the same size or larger than the smallest drive at creation of the pool. This should surprise no one.

In practice large drives will differ slightly in size (unless they have the same partnumber). It is therefore prudent to not use the full drive, but leave, say, 100MB-1GB unused at the end. That way it is certain that the new 2TB drive - that you managed to get after hours - will not be 20 sectors too small.

With this in mind, my creation above would have been something like:

guide ~ 2> fgrep sectors: /var/log/messages
Jul 26 18:13:45 <kern.crit> guide kernel: da1: 953869MB (1953525167 512 byte sectors: 255H 63S/T 121601C)
# ...   # This time my phone was da0

guide ~ 1# gpart destroy -F da1
guide ~ 2# gpart create -s GPT da1
guide ~ 3# gpart add -b 2048 -s 1953000000 -t freebsd-zfs da1       # (1953525167-1953000000)/2/1024 = 256MB free
                                                                    # Start at block 2048 means ready for 4K drives
guide ~ 5# gpart show da1
=>        34  1953525100  da1  GPT  (931G)
      34        2014       - free -  (1M)
    2048  1953000000    1  freebsd-zfs  (931G)
  1953002048      523086       - free -  (255M)
# Repeat for da2 and da3

guide ~ 10# geli init -s 4096 /dev/da1p1                            # Note the use of "p1"
# Repeat for da2 and da3

guide ~ 15# geli attach /dev/da1p1
# Repeat for da2 and da3

guide ~ 20# zpool create tank raidz1 /dev/da1p1.eli /dev/da2p1.eli /dev/da3p1.eli

And we are now ready with .5G space less in tank (256M less at 3 drives, of which 1/3 would have been used for redundancy).

Conclusion

I feel that this way of adding crypto to ZFS is the best way, at least until the day Oracle decides to opensource the changes from v28 to v33. It does encrypt the data plus the redundancy, but I see no real alternative to this, if I want the extra benefits of ZFS.

Hi,

Thank you for documenting this experiment/observation with others. I really appreciate it. I am a bsd noob trying to covert my system to bsd and very much interested in encrypted zfs. I read somewhere that zfs only supports up to 3 raidz parity disks as I found out here -> https://blogs.oracle.com/ahl/entry/what_is_raid_z and so does most of the example I found on the net.

Another question, Are there any functionality/programs like truecrypt in bsd where I can hide partitions/(pools?) in bsd just like in linux/windows ?

Thank you again for a thoughtful write-up on encrypted zfs.

Comment by hafidz Thu Aug 16 15:09:35 2012

(Sorry about the delay. I have not yet set up monitoring of the comment moderation queue. UPDATE: This has been fixed now)

I do not understand why you would want more than 3 parity disks. The ability to loose 3 disks before loosing data is pretty OK. Are you sure that you are not confusing the number of disks in the pool with the number of parity disks?

As for truecrypt, you might be looking for /usr/ports/security/truecrypt

Comment by fj Thu Aug 30 20:41:17 2012

I've been considering converting my Windows Server 2008r2 setup with 6x3TB in raid 5 to a ZFS built with encryption. So i'd be using FreeBSD 9.1 + GELI + ZFS raidz1 or z2 pool.

What kind of performance can I expect? Can I easily saturate a 1 gbit connection? And how much CPU does it use? Is the performance cpu or hdd bound? And if so, which cpu do you use?

Hopefully you can give me some more insight.

Thanks

Comment by sukosevato Mon Feb 18 00:24:17 2013

Performance depends on your CPU. If it can not encrypt/decrypt the data fast enough, you are going to wait on it.

My home server is a HP MicroServer N36L. It has an Atom-like dual core CPU, running at 1.3GHz. Not a speedmonster in any way. 4 drives attached, running raidz1.

I can read around 55MB/s from the encrypted ZFS. While doing this, the CPU is maxed out.

If your server has a modern high performance CPU, you will be able to saturate a 1Gb link with no problem.

Comment by fj Fri Feb 22 17:41:17 2013
For performance ideas, have a look at Dan Langille's results. He has looked much more at this than I have.
Comment by fj Mon Feb 25 10:21:17 2013