Eric Schrock's Blog

Month: August 2008

Last week, Rob Johnston and I coordinated two putbacks to Solaris to further the cause of Solaris platform integration, this time focusing on sensors and indicators. Rob has a great blog post with an overview of the new sensor abstraction layer in libtopo. Rob did most of the hard work- my contribution consisted only of extending the SES enumerator to support the new facility infrastructure.

You can find a detailed description of the changes in the original FMA portfolio here, but it’s much easier to understand via demonstration. This is the fmtopo output for a fan node in a J4400 JBOD:

hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0
group: protocol                       version: 1   stability: Private/Private
resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0
label             string    Cooling Fan  0
FRU               fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0
group: authority                      version: 1   stability: Private/Private
product-id        string    SUN-Storage-J4400
chassis-id        string    2029QTF0000000005
server-id         string
group: ses                            version: 1   stability: Private/Private
node-id           uint64    0x1f
target-path       string    /dev/es/ses3
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=ident
group: protocol                       version: 1   stability: Private/Private
resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=ident
group: authority                      version: 1   stability: Private/Private
product-id        string    SUN-Storage-J4400
chassis-id        string    2029QTF0000000005
server-id         string
group: facility                       version: 1   stability: Private/Private
type              uint32    0x1 (LOCATE)
mode              uint32    0x0 (OFF)
group: ses                            version: 1   stability: Private/Private
node-id           uint64    0x1f
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=fail
group: protocol                       version: 1   stability: Private/Private
resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=fail
group: authority                      version: 1   stability: Private/Private
product-id        string    SUN-Storage-J4400
chassis-id        string    2029QTF0000000005
server-id         string
group: facility                       version: 1   stability: Private/Private
type              uint32    0x0 (SERVICE)
mode              uint32    0x0 (OFF)
group: ses                            version: 1   stability: Private/Private
node-id           uint64    0x1f
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=speed
group: protocol                       version: 1   stability: Private/Private
resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=speed
group: authority                      version: 1   stability: Private/Private
product-id        string    SUN-Storage-J4400
chassis-id        string    2029QTF0000000005
server-id         string
group: facility                       version: 1   stability: Private/Private
sensor-class      string    threshold
type              uint32    0x4 (FAN)
units             uint32    0x12 (RPM)
reading           double    3490.000000
state             uint32    0x0 (0x00)
group: ses                            version: 1   stability: Private/Private
node-id           uint64    0x1f
hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=fault
group: protocol                       version: 1   stability: Private/Private
resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=fault
group: authority                      version: 1   stability: Private/Private
product-id        string    SUN-Storage-J4400
chassis-id        string    2029QTF0000000005
server-id         string
group: facility                       version: 1   stability: Private/Private
sensor-class      string    discrete
type              uint32    0x103 (GENERIC_STATE)
state             uint32    0x1 (DEASSERTED)
group: ses                            version: 1   stability: Private/Private
node-id           uint64    0x1f

Here you can see the available indicators (locate and service), the fan speed (3490 RPM) and if the fan is faulted. Right now this is just interesting data for savvy administrators to play with, as it’s not used by any software. But that will change shortly, as we work on the next phases:

  • Monitoring of sensors to detect failure in external components which have no visibility in Solaris outside libtopo, such as power supplies and fans. This will allow us to generate an FMA fault when a power supply or fan fails, regardless of whether it’s in the system chassis or an external enclosure.
  • Generalization of the disk-monitor fmd plugin to support arbitrary disks. This will control the failure indicator in response to FMA-diagnosed faults.
  • Correlation of ZFS faults with the associated physical disk. Currently, ZFS faults are against a “vdev” – a ZFS-specific construct. The user is forced to translate from this vdev to a device name, and then use the normal (i.e. painful) methods to figure out which physical disk was affected. With a little work it’s possible to include the physical disk in the FMA fault to avoid this step, and also allow the fault LED to be controlled in response to ZFS-detected faults.
  • Expansion of the SCSI framework to support native diagnosis of faults, instead of a stream of syslog messages. This involves generating telemetry in a way that can be consumed by FMA, as well as a diagnosis engine to correlate these ereports with an associated fault.

Even after we finish all of these tasks and reach the nirvana of a unified storage management framework, there will still be lots of open questions about how to leverage the sensor framework in interesting ways, such as a prtdiag-like tool for assembling sensor information, or threshold alerts for non-critical warning states. But with these latest putbacks, it feels like our goals from two years ago are actually within reach, and that I will finally be able to turn on that elusive LED.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012

Archives