Rook and Ceph upgrades are designed to ensure data remains available even while the upgrade is proceeding. Rook will perform the upgrades in a rolling fashion such that application pods are not disrupted. To ensure the upgrades are seamless, it is important to begin the upgrades with Ceph in a fully healthy state. This guide reviews ways of verifying the health of a CephCluster.
See the troubleshooting documentation for any issues during upgrades:
Pods all Running¶
In a healthy Rook cluster, all pods in the Rook namespace should be in the
Completed) state and have few, if any, pod restarts.
The Rook toolbox contains the Ceph tools that gives status details of the cluster with the
ceph status command. Below is an output sample:
The output should look similar to the following:
In the output above, note the following indications that the cluster is in a healthy state:
- Cluster health: The overall cluster status is
HEALTH_OKand there are no warning or error status messages displayed.
- Monitors (mon): All of the monitors are included in the
- Manager (mgr): The Ceph manager is in the
- OSDs (osd): All OSDs are
- Placement groups (pgs): All PGs are in the
- (If applicable) Ceph filesystem metadata server (mds): all MDSes are
activefor all filesystems
- (If applicable) Ceph object store RADOS gateways (rgw): all daemons are
ceph status output has deviations from the general good health described above, there may be an issue that needs to be investigated further. Other commands may show more relevant details on the health of the system, such as
ceph osd status. See the Ceph troubleshooting docs for help.
Upgrading an unhealthy cluster¶
Rook will not upgrade Ceph daemons if the health is in a
HEALTH_ERR state. Rook can be configured to proceed with the (potentially unsafe) upgrade by setting either
skipUpgradeChecks: true or
continueUpgradeAfterChecksEvenIfNotHealthy: true as described in the cluster CR settings.
The container version running in a specific pod in the Rook cluster can be verified in its pod spec output. For example, for the monitor pod
mon-b, verify the container version it is running with the below commands:
The status and container versions for all Rook pods can be collected all at once with the following commands:
rook-version label exists on Ceph resources. For various resource controllers, a summary of the resource controllers can be gained with the commands below. These will report the requested, updated, and currently available replicas for various Rook resources in addition to the version of Rook for resources managed by Rook. Note that the operator and toolbox deployments do not have a
rook-version label set.
Rook Volume Health¶
Any pod that is using a Rook volume should also remain healthy:
- The pod should be in the
Runningstate with few, if any, restarts
- There should be no errors in its logs
- The pod should still be able to read and write to the attached Rook volume.