Skip to content

Platform telemetry and health monitoring Work Group

ladkani edited this page May 2, 2022 · 19 revisions

Work-Group Goals

Telemetry

  • How can we enable run-time telemetry collection?
  • How can we enable a third party HW/SW subsystem to create and harvest telemetry?
  • Enable telemetry across platform components(Host subsystem i.e. host processors, memory, IO, GPU, and custom silicon and manageability subsystem i.e. BMC CPU, BMC DRAM, BMC IOs).

Health Monitoring:

  • How can we determine unresponsive host/hypervisors due to HW/SW issues.
  • How can we collect information to debug unresponsive hosts/hypervisors ( Inband and Out of Band)

Workstream Lead: Neeraj Ladkani (Microsoft)

Meeting time(Bi-Weekly):

Alternative Tuesdays: 9:30 AM PST

Microsoft Teams Meeting :

https://teams.microsoft.com/l/meetup-join/19%3ameeting_NjQ4OTFmNzYtNTlhOS00N2Q1LTlkYzUtNDBjYWQ0OGQ2OGJi%40thread.v2/0?context=%7b%22Tid%22%3a%2272f988bf-86f1-41af-91ab-2d7cd011db47%22%2c%22Oid%22%3a%221c29c2f7-d386-4c5a-a3bc-0ee13b48bc65%22%7d

+1 323-849-4874 United States, Los Angeles (Toll) (866) 679-9995 (Toll-free) Conference ID: 636 049 49# Local numbers | Reset PIN | Learn more about Teams | Meeting options

Join with a video conferencing device 813878896@t.plcm.vc VTC Conference ID: 017988591 Alternate VTC dialing instructions

Requirement Gathering Sheet https://docs.google.com/spreadsheets/d/12gMMXB9r_WfWDf5wz-Z_zXsz6RNheC6p2LKp7HePAEE/edit?usp=sharing


Minutes of meeting: 12/10/19 @ 9:30 AM PST

Areas discussed:

Change bi-weekly meeting time to 12:30 PM PST

Workgroup will propose changes to schemas include OEM matric definition ( or create new schema)

Dynamic configuration of dbus sensors for telemetry collection. – Already in design

Pitor will update the timeline for implementation, Intel will contribute changes for Redfish service

Minutes of meeting : 08/06/19 @ 10:00 PM IST

Attendees: Neeraj ( Microsoft ) Vijay ( Facebook ) Paul Vancill ( Dell ) Kun Yi ( Google ) Naidu et al ( Intel ) Vishwa ( IBM )

I remember we had 2 more. Sorry, can not quite recall names.

Areas discussed:

Redfish Telemetry : Team spent some time discussing about the Redfish telemetry proposal that Paul has put for review.

Paul would make a patch-set-2 and then upload. Overall, team agrees that Redfish telemetry model would suit the needs. Also mentioned that he would be sharing some pointers on Telemetry mockup.

Intel mentioned that they already have a Telemetry implementation that is working. Looking for that to be upstreamed. Intel to share the details on their current implementation to the mailing list.

On the data transfer, we agree that support for HTTP Push / pull and SSE are needed. Vishwa mentioned that Ratan ( IBM ) is doing something on those lines. Vishwa has involved Ratan on the redfish telemetry proposal on gerrit

Inband Redfish: Paul Vancill mentioned about some implementation of having USB eth between Host and BMC and thus being able to use Redfish for the Inband communication as well. However, team understands that, it involves having a stack to make that happen and is something beyond the current work stream.

Metric collection: Team agrees that, collectd storing metric as RoundRobin(RR) files is the right thing to do. ( Note: librrd does provide good number of APIs to handle the data / files )

Kun already has the collectd brought into OpenBMC and will push it soon the repo is created.

Vishwa and Kun touched on the IO Hell that is attributed to RR files. Kun mentioned that, as long as the number of data points do not exceed way too much, we don't have an issue. Also, there are ways to work around IO hell.

We need to see how to make RR data available to the users. For example: How does Redfish telemetry framework use this etc..

Inband Metric transfer: Kun's current proposal talks about using protobuf as a means to transfer metric inband. Vishwa indicated a need to achieve this using PLDM framework. Vishwa to provide links for PLDM and IBM implementation to the team.

Vijay suggested that, we would leave it on the implementation to chose what is needed for that environment. Default would be IPMI OEM. If PLDM is desired, it needs to be plugged in accordingly.

Please help correct /add.

[Vishwa]


MOM ( 6/24/19)

Paul(Dell) proposed redfish telemetry https://www.dmtf.org/documents/redfish-spmf/redfish-telemetry-white-paper-010a

  • Can we leverage existing redfish spec for telemetry collections?
  • Current redfish telemetry defines triggers and trigger actions.
  • Software running outside can send redfish request to BMC to start collecting system telemetry and request data in the same format
  • Paul can share redfish mock-up for telemetry report
  • Binary blobs: Should they be used or we can use redfish telemetry collection?
  • Once data is generated, we can define matric report generation
  • The same scheme can be used to push and pull from BMC
  • Can we propose "BMC subsystem metric" like BMC CPU, Memory and IO as a metric?

Meeting time Tuesdays: 9:30 AM PST and 9:00 PM PST ( rotate bi-weekly)


Notes:

(Joseph Reynolds 2019-06-12): I've added this to the Security Working Group agenda. And I've decorated that agenda item with my notes, including:

  • I understand this is an application for https://en.wikipedia.org/wiki/Telemetry where the BMC collects data and sends it to an external telemetry server on the BMC’s management network. For example, this could be done via collectd (https://collectd.org/) and use the cryptographic extensions of the collectd network plugin.
  • Apply the CIA triad (https://en.wikipedia.org/wiki/CIA_triad) to the telemetry data stream:
  • How important is it to authenticate that the telemetry data received came from the correct BMC? For example, what if someone provided false data to the telemetry server or tampered with it (changed it) as it went across the network? Or two different BMCs’ telemetry stream are not staying separate?
  • How important is it that the telemetry data arrive in a timely manner? For example, what if it never arrives?
  • How important is it that the telemetry data remain confidential? For example, what if someone were to read it?

MOM ( 6/10/19 )

Round Table - Introduction and high-level expectations:

Neeraj Ladkani, Sagar Dharia ( Microsoft) : To Create a generic framework that supports BMC FW and near real-time platform health. Support scenarios to debug unresponsive BMCs and interfaces ( black box ). It should help debug unresponsive hosts ( OS hangs, CPU IERRs, Memory uncorrectable errors, PCIe errors)

Kun Yi ( Google): A Service that can run on BMC as endpoint, provide real-time health reports by large automation systems

Vijay Khemka ( Facebook ) : A Health monitoring demon that can read BMC health and post it outside BMC Highly configurable so that it can be suitable for most of the use cases and scenarios.

Srinivas ( IBM ): Platform Health monitoring

Vishwa (IBM) : There was an interest in the group to handle the crash dump also as part of this framework, and Vishwa mentioned that it could be outside of this and can be plugged into the sas-report and ABRT standard crash reporting mechanisms. Interested in hardware events like I2C recovery, FSI, spurious interrupts etc.. Streaming the metrics given by OCC to external clients Health Monitoring daemon that can watch the flash wear and tear in addition to doing things that are being mentioned by others already.

Kisan ( Cisco ): Would like event-driven telemetry that can help monitor internal platform events and expose over redfish

Sivas ( IBM ) : how can we add these requirements for OpenBMC test framework ?

Notes:

• Meeting time: Votes for PST time zone, next meeting time: Tuesday ( 06/25) 9:30 AM PST 
• Need to create a wiki page for this track. 
• Four sub-tracks 

	a. What to capture:
		i. BMC subsystem: CPU usage, memory usage, storage, Linux subsystem
		ii. Host subsystem: host CPU, Memory, GPU, IO, FPGA, Network, Sensors, OS subsystem, Thermal, Hardware subsystem
		
	b. How to capture: 
		 i. collectd sounds promising so far, supports third-party plugins. 
			□ Kun and Neeraj to provide an update in next meeting tradeoffs using collectd 
		
	c. Where to capture:
		i. Information can be in volatile memory ( Memory ) 
		ii. Critical information should be storage (SPI, EEPROM, SD, eMMC ) 
		
	d. How to retrieve:
		i. IPMI over LAN is not preferred due to security reasons. 
		Ii. plain text vs binary ? 
		iii. Redfish is preferred 
			□ Viswa to provide an update in next meeting if we can support exporting raw blobs over redfish. 
			
• Configurability is very critical in solution so that it can be tailored for most needs. 
• Should support OEM ways to do things so that it can be extended if required. 
• Space requirements should be considered in the design for certain implementations. 
• Should be aligned with security requirements.

Feel free to add if I missed capturing any specific details.

Neeraj