We recently had an issue with Crucial M4 Solid State Disks when using them with ZFS on Solaris 11 Express (snv_151a). Basically the disks were showing a whole bunch of write errors and had been “FAULTED” by ZFS. Now to make this problem even worse when we tried to zfs clear them it locked up my SSH session, as well as subsequent sessions, it would allow me to initiate subsequent sessions (and authenticate), however would not deliver me to the prompt. Additionally when I went onto the console if I ran a “zfs list, zpool list, or zpool status” it would lock up the shell every time. When I say lock up, I mean lock up, no CTRL + C or Z or anything. So since these devices were cache devices only this wasn’t the end of the world for us.
Now to get past the locked up shell, I simply removed the disks physically from the server, which allowed me to get my prompts back (I don’t recall if I had to CTRL + C, but I don’t think so. Then I was able to zpool clear the devices to remove the FAULTED status and zpool offline them. Also it is important to note, in our environment I am not aware of any “downtime” which we suffered due to this, it was simply the shells. Our CIFS was still serving, as well as our NFS, we aren’t hosting any Fibre Channel on this one yet, but I would suspect that would not have been affected either.
# zpool status pool: rpool state: ONLINE scan: resilvered 19.6G in 0h6m with 0 errors on Wed Apr 20 17:07:19 2011 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c0t5000C500339B1447d0s0 ONLINE 0 0 0 c0t5000C500339AFB37d0s0 ONLINE 0 0 0 errors: No known data errors pool: tank state: ONLINE scan: scrub repaired 0 in 0h0m with 0 errors on Wed May 11 15:20:40 2011 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c0t5000C50033D5E9AFd0 ONLINE 0 0 0 c0t5000C50033D60E4Bd0 ONLINE 0 0 0 c0t5000C50033D62FC3d0 ONLINE 0 0 0 c0t5000C500260B365Bd0 ONLINE 0 0 0 c0t5000C500260FB5BBd0 ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 c0t5000C500262A1883d0 ONLINE 0 0 0 c0t5000C5002627A2BBd0 ONLINE 0 0 0 c0t5000C50026016D3Fd0 ONLINE 0 0 0 c0t5000C50026282A07d0 ONLINE 0 0 0 c0t5000C50026132717d0 ONLINE 0 0 0 cache c0t500A0751030437A3d0 FAULTED 1 65 0 too many errors c0t500A075103043823d0 FAULTED 0 124 0 too many errors errors: No known data errors
However this did not solve the problem. As of right now it appears that this problem was caused by a firmware bug. The drives we had came with “Firmware Rev: 0001” with the latest version being 0002. So if we inspect the changelog…
Release Date: 06/8/2011
- Added margin to already-passing electromagnetic interference regulatory tests. Provides additional EMI margin for systems integrators.
- Improved performance with Link Power Management. Resolves performance pauses and hesitations with certain host systems.
- This is a recommended but not required firmware update. If the end user is experiencing pauses or hesitations in systems with Link Power Management (“LPM”) enabled, then this update is highly recommended.
Now here is what we were looking for. Above I highlighted “pauses or hesitations” this is the problem, these pauses or hesitations cause read/write failures. This is exactly the same reason why hardware RAID + ZFS is a bad thing. So a firmware update was in order. The great news about this firmware is that it comes with a really small Linux LiveCD which detects the disks and does the firmware update really quickly. Read the guide that Crucial provided because there are some BIOS changes that will need to be made on the system where you are installing the Firmware to the drives (not your ZFS system), I used a spare desktop.
Also save yourself the heartache in the future, since Crucial was so nice to put the firmware revision on the sticker, it would be wise to update the sticker on yours, I used our label maker. That way if you have to go to rev 3 or 16 or whatever you know where you are in case there is a requirement for you to upgrade through certain versions.
The most ironic thing of the whole situation is that we just got support for this box a few days before this happened, didn’t need to use it this time.
If you have also experienced this or a similar issue please leave a comment and tell me about it.
UPDATE – JANUARY 18, 2012
The Crucial M4 SSDs have another more critical firmware bug. Read more here.