Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PSU & system health] Support PSU power threshold checking #1060

Merged
merged 23 commits into from Nov 18, 2022

Conversation

stephenxs
Copy link
Collaborator

@stephenxs stephenxs commented Aug 18, 2022

PRs

PR title state context
[PSU] Add warning/critical thresholds for PSU power GitHub issue/pull request detail GitHub pull request check contexts
[PSU daemon] Support PSU power threshold checking GitHub issue/pull request detail GitHub pull request check contexts
[psushow & psuutil] Support PSU power threshold checking GitHub issue/pull request detail GitHub pull request check contexts
[system health daemon] Support PSU power threshold checking GitHub issue/pull request detail GitHub pull request check contexts
[Mellanox] Support PSU power threshold checking GitHub issue/pull request detail GitHub pull request check contexts
[Mellanox] Support PSU power threshold checking GitHub issue/pull request detail GitHub pull request check contexts

Support PSU power threshold checking

  • Two platform APIs are introduced to represent the warning and critical thresholds of a PSU's power
  • In the main loop
    • Whenever a PSU becomes good, PSU daemon tries calling platform APIs to fetch both thresholds. If None is returned or NotImplemented is thrown by either API, the PSU power threshold checking will not be performed for the PSU
    • the PSU daemon compares the power with thresholds, exposing status to the database, and logging messages.

Signed-off-by: Stephen Sun stephens@nvidia.com

Signed-off-by: Stephen Sun <stephens@nvidia.com>
doc/psud/PSU_daemon_design.md Outdated Show resolved Hide resolved
doc/psud/PSU_daemon_design.md Outdated Show resolved Hide resolved
doc/psud/PSU_daemon_design.md Show resolved Hide resolved
doc/psud/PSU_daemon_design.md Outdated Show resolved Hide resolved
doc/psud/PSU_daemon_design.md Show resolved Hide resolved
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs stephenxs changed the title [PSU] Support PSU power threshold checking [PSU & system health] Support PSU power threshold checking Aug 29, 2022
liat-grozovik
liat-grozovik previously approved these changes Aug 29, 2022
@liat-grozovik
Copy link
Collaborator

@prgeor kinldy reminder to review and provide your feedback

@stephenxs
Copy link
Collaborator Author

Conflicts resolved.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
@zhangyanzhao
Copy link
Collaborator

@stephenxs please update the https://github.com/sonic-net/sonic-utilities/blob/master/doc/Command-Reference.md with the new command line.

@prgeor
Copy link
Contributor

prgeor commented Nov 10, 2022

@shyam77git can you review from chassis point of view?

@zhangyanzhao
Copy link
Collaborator

@prgeor the PR is approved and build passed, are you ok to merge? Thanks.

Copy link
Contributor

@prgeor prgeor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephenxs as discussed in the community can we have an agreement that the "warning threshold" is confusing because no warning is raised at "warning" threshold? How about renaming it to "warning-suppress" level?

- `<power>` is the current power of the PSU
- `<threshold>` is the warning threshold of the PSU
- Otherwise: no action

## 3. DB schema for PSU

PSU number is stored in chassis table. Please refer to this [document](https://github.com/sonic-net/SONiC/blob/master/doc/pmon/pmon-enhancement-design.md), section 1.5.2.

PSU information is stored in PSU table:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this table for all PSUs in the system?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Comment on lines 140 to 142
power_overload = "true" / "false" ; whether the PSU's power exceeds the threshold
power_warning_threshold = 1*4.3DIGIT ; The power warning threshold
power_critical_threshold = 1*4.3DIGIT ; The power critical threshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these thresholds the same for all PSUs?

Copy link
Collaborator Author

@stephenxs stephenxs Nov 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For us, yes. But it's possible for other vendors to have different thresholds among all PSUs.

- `Not OK` which can be caused by:
- power is not good, which means the PSU is present but no power (Eg. the power is down or power cable is unplugged)
- `WARNING` which can be caused by:
- power exceeds the PSU's power threshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which threshold? critical or warning?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The critical threshold. Will fix it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


def get_psu_power_warning_threshold(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a set API for user to override?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not suggest having an API to override the thresholds. As discussed before, it should be controlled by the platform vendor only because users do not have full knowledge to decide what the threshold should be. For example, in our system, the thresholds depend on temperatures as well.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
voltage_max_th = INT ; max threshold of the output voltage
voltage_min_th = INT ; min threshold of the output voltage
power_overload = "true" / "false" ; whether the PSU's power exceeds the threshold
power_warning_threshold = 1*4.3DIGIT ; The power warning threshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming not changed to supress

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -275,6 +278,7 @@ When something is wrong
orchagent is not running
Hardware Fault
PSU 1 temp 85C and threshold is 70C
PSU 1 power (66.32w) exceeds thresholds (warning: 60.00w critical: 70.00w)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'thresholds' which threshold?

Copy link
Collaborator Author

@stephenxs stephenxs Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's hard to say which threshold is crossed because most times it exceeds the critical threshold but sometimes it can be in (warning_suppress_threshold, critical_threshold) because of hysteresis.
I think we can just put the value of warning_suppress_threshold here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

max_power = 1*4.3DIGIT ; power capacity of the psu
power_overload = "true" / "false" ; whether the PSU's power exceeds the threshold
power_warning_suppress_threshold = 1*4.3DIGIT ; The power warning threshold
power_critical_threshold = 1*4.3DIGIT ; The power critical threshold

Now psud only collect and update "presence" and "status" field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need one field in the DB which indicates the critical threshold warning is ACTIVE. this is needed for telemetry

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

power_overload is the field to indicate whether it’s in the warning state

Copy link
Contributor

@prgeor prgeor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephenxs its not clear how is the user notified? if syslog how many times syslog is raised?

stephenxs and others added 2 commits November 17, 2022 23:58
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Fix comments
@stephenxs
Copy link
Collaborator Author

@stephenxs its not clear how is the user notified? if syslog how many times syslog is raised?

syslog. it is printed only once when the alarm is raised and cleared.
PS. it's not relevant to the feature itself but is how it works in PSU daemon.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
4. PSU's power was in range (warning threshold, critical threshold) and is greater than the critical threshold
1. if warning was raised, no action expected
2. if warning was not raised, a warning should be raised
5. PSU's power was less than the warning threshold and is greater than the critical threshold: a warning should be raised
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its confusing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to test the case that the PSU power jumps from a value below the warning suppress threshold to a value above critical threshold.

voltage = INT ; output voltage of the PSU
voltage_max_th = INT ; max threshold of the output voltage
voltage_min_th = INT ; min threshold of the output voltage
power_overload = "true" / "false" ; whether the PSU's power exceeds the threshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me check if this should be INT for telemetry

Signed-off-by: Stephen Sun <stephens@nvidia.com>
We use asymmetric thresholds between raising and clearing the alarm for the purpose of creating a hysteresis and avoiding alarm flapping.

- an alarm will be raised when a PSU's power is rising accross the critical threshold
- an alarm will be cleared when a PSU's power is dropping across the warning threshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning threshold -> 'warning suppress' threshold


1. Retrieve the current power
2. If flag `PSU power exceeded threshold` is `true`, compare the current power against the warning threshold
- If `current power` < `warning threshold`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning-suppress threshold

2. If flag `PSU power exceeded threshold` is `true`, compare the current power against the warning threshold
- If `current power` < `warning threshold`
- Set `PSU power exceeded threshold` to `false`
- Message in NOTICE level should be logged: `PSU <x>: current power <power> is below the warning threshold <threshold>` where
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning-suppress threshold

Comment on lines 190 to 195
`psuutil` fetches the information via calling platform API directly. Both warning and critical thresholds will be exposed in the output of psuutil status.
The "WARNING" state is not exposed because psuutil is a one-time command instead of a daemon, which means it does not store state information. It fetches information via calling platform API so it can not distinguish the following status:

1. The power exceeded the critical threshold but is in the range between the warning and critical thresholds, which means the alarm should be raised
2. The power didn't exceed the critical threshold and exceeds the warning threshold, which means the alarm should not be raised

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning -> 'warning-suppress'

Signed-off-by: Stephen Sun <stephens@nvidia.com>
@prgeor prgeor merged commit bf6fa10 into sonic-net:master Nov 18, 2022
@prgeor prgeor added the PSU label Nov 18, 2022
@prgeor prgeor self-assigned this Nov 18, 2022
@stephenxs stephenxs deleted the psu-power-threshold-github branch November 18, 2022 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants