Last week, Rob Johnston and I coordinated two putbacks to Solaris to further the cause of Solaris platform integration, this time focusing on sensors and indicators. Rob has a great blog post with an overview of the new sensor abstraction layer in libtopo. Rob did most of the hard work- my contribution consisted only of extending the SES enumerator to support the new facility infrastructure.
You can find a detailed description of the changes in the original FMA portfolio here, but it’s much easier to understand via demonstration. This is the fmtopo output for a fan node in a J4400 JBOD:
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0 group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0 label string Cooling Fan 0 FRU fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0 group: authority version: 1 stability: Private/Private product-id string SUN-Storage-J4400 chassis-id string 2029QTF0000000005 server-id string group: ses version: 1 stability: Private/Private node-id uint64 0x1f target-path string /dev/es/ses3 hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=ident group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=ident group: authority version: 1 stability: Private/Private product-id string SUN-Storage-J4400 chassis-id string 2029QTF0000000005 server-id string group: facility version: 1 stability: Private/Private type uint32 0x1 (LOCATE) mode uint32 0x0 (OFF) group: ses version: 1 stability: Private/Private node-id uint64 0x1f hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=fail group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=fail group: authority version: 1 stability: Private/Private product-id string SUN-Storage-J4400 chassis-id string 2029QTF0000000005 server-id string group: facility version: 1 stability: Private/Private type uint32 0x0 (SERVICE) mode uint32 0x0 (OFF) group: ses version: 1 stability: Private/Private node-id uint64 0x1f hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=speed group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=speed group: authority version: 1 stability: Private/Private product-id string SUN-Storage-J4400 chassis-id string 2029QTF0000000005 server-id string group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x4 (FAN) units uint32 0x12 (RPM) reading double 3490.000000 state uint32 0x0 (0x00) group: ses version: 1 stability: Private/Private node-id uint64 0x1f hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=fault group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=fault group: authority version: 1 stability: Private/Private product-id string SUN-Storage-J4400 chassis-id string 2029QTF0000000005 server-id string group: facility version: 1 stability: Private/Private sensor-class string discrete type uint32 0x103 (GENERIC_STATE) state uint32 0x1 (DEASSERTED) group: ses version: 1 stability: Private/Private node-id uint64 0x1f
Here you can see the available indicators (locate and service), the fan speed (3490 RPM) and if the fan is faulted. Right now this is just interesting data for savvy administrators to play with, as it’s not used by any software. But that will change shortly, as we work on the next phases:
- Monitoring of sensors to detect failure in external components which have no visibility in Solaris outside libtopo, such as power supplies and fans. This will allow us to generate an FMA fault when a power supply or fan fails, regardless of whether it’s in the system chassis or an external enclosure.
- Generalization of the disk-monitor fmd plugin to support arbitrary disks. This will control the failure indicator in response to FMA-diagnosed faults.
- Correlation of ZFS faults with the associated physical disk. Currently, ZFS faults are against a “vdev” – a ZFS-specific construct. The user is forced to translate from this vdev to a device name, and then use the normal (i.e. painful) methods to figure out which physical disk was affected. With a little work it’s possible to include the physical disk in the FMA fault to avoid this step, and also allow the fault LED to be controlled in response to ZFS-detected faults.
- Expansion of the SCSI framework to support native diagnosis of faults, instead of a stream of syslog messages. This involves generating telemetry in a way that can be consumed by FMA, as well as a diagnosis engine to correlate these ereports with an associated fault.
Even after we finish all of these tasks and reach the nirvana of a unified storage management framework, there will still be lots of open questions about how to leverage the sensor framework in interesting ways, such as a prtdiag-like tool for assembling sensor information, or threshold alerts for non-critical warning states. But with these latest putbacks, it feels like our goals from two years ago are actually within reach, and that I will finally be able to turn on that elusive LED.