Skip to content

Health Verification

Rook and Ceph upgrades are designed to ensure data remains available even while the upgrade is proceeding. Rook will perform the upgrades in a rolling fashion such that application pods are not disrupted. To ensure the upgrades are seamless, it is important to begin the upgrades with Ceph in a fully healthy state. This guide reviews ways of verifying the health of a CephCluster.

See the troubleshooting documentation for any issues during upgrades:

Pods all Running

In a healthy Rook cluster, all pods in the Rook namespace should be in the Running (or Completed) state and have few, if any, pod restarts.

kubectl -n $ROOK_CLUSTER_NAMESPACE get pods

Status Output

The Rook toolbox contains the Ceph tools that gives status details of the cluster with the ceph status command. Below is an output sample:

TOOLS_POD=$(kubectl -n $ROOK_CLUSTER_NAMESPACE get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[*]}')
kubectl -n $ROOK_CLUSTER_NAMESPACE exec -it $TOOLS_POD -- ceph status

The output should look similar to the following:

    id:     a3f4d647-9538-4aff-9fd1-b845873c3fe9
    health: HEALTH_OK

    mon: 3 daemons, quorum b,c,a
    mgr: a(active)
    mds: myfs-1/1/1 up  {0=myfs-a=up:active}, 1 up:standby-replay
    osd: 6 osds: 6 up, 6 in
    rgw: 1 daemon active

    pools:   9 pools, 900 pgs
    objects: 67  objects, 11 KiB
    usage:   6.1 GiB used, 54 GiB / 60 GiB avail
    pgs:     900 active+clean

    client:   7.4 KiB/s rd, 681 B/s wr, 11 op/s rd, 4 op/s wr
    recovery: 164 B/s, 1 objects/s

In the output above, note the following indications that the cluster is in a healthy state:

  • Cluster health: The overall cluster status is HEALTH_OK and there are no warning or error status messages displayed.
  • Monitors (mon): All of the monitors are included in the quorum list.
  • Manager (mgr): The Ceph manager is in the active state.
  • OSDs (osd): All OSDs are up and in.
  • Placement groups (pgs): All PGs are in the active+clean state.
  • (If applicable) Ceph filesystem metadata server (mds): all MDSes are active for all filesystems
  • (If applicable) Ceph object store RADOS gateways (rgw): all daemons are active

If the ceph status output has deviations from the general good health described above, there may be an issue that needs to be investigated further. Other commands may show more relevant details on the health of the system, such as ceph osd status. See the Ceph troubleshooting docs for help.

Upgrading an unhealthy cluster

Rook will not upgrade Ceph daemons if the health is in a HEALTH_ERR state. Rook can be configured to proceed with the (potentially unsafe) upgrade by setting either skipUpgradeChecks: true or continueUpgradeAfterChecksEvenIfNotHealthy: true as described in the cluster CR settings.

Container Versions

The container version running in a specific pod in the Rook cluster can be verified in its pod spec output. For example, for the monitor pod mon-b, verify the container version it is running with the below commands:

POD_NAME=$(kubectl -n $ROOK_CLUSTER_NAMESPACE get pod -o --no-headers | grep rook-ceph-mon-b)
kubectl -n $ROOK_CLUSTER_NAMESPACE get pod ${POD_NAME} -o jsonpath='{.spec.containers[0].image}'

The status and container versions for all Rook pods can be collected all at once with the following commands:

kubectl -n $ROOK_OPERATOR_NAMESPACE get pod -o jsonpath='{range .items[*]}{}{"\n\t"}{.status.phase}{"\t\t"}{.spec.containers[0].image}{"\t"}{.spec.initContainers[0]}{"\n"}{end}' && \
kubectl -n $ROOK_CLUSTER_NAMESPACE get pod -o jsonpath='{range .items[*]}{}{"\n\t"}{.status.phase}{"\t\t"}{.spec.containers[0].image}{"\t"}{.spec.initContainers[0].image}{"\n"}{end}'

The rook-version label exists on Ceph resources. For various resource controllers, a summary of the resource controllers can be gained with the commands below. These will report the requested, updated, and currently available replicas for various Rook resources in addition to the version of Rook for resources managed by Rook. Note that the operator and toolbox deployments do not have a rook-version label set.

kubectl -n $ROOK_CLUSTER_NAMESPACE get deployments -o jsonpath='{range .items[*]}{}{"  \treq/upd/avl: "}{.spec.replicas}{"/"}{.status.updatedReplicas}{"/"}{.status.readyReplicas}{"  \trook-version="}{.metadata.labels.rook-version}{"\n"}{end}'

kubectl -n $ROOK_CLUSTER_NAMESPACE get jobs -o jsonpath='{range .items[*]}{}{"  \tsucceeded: "}{.status.succeeded}{"      \trook-version="}{.metadata.labels.rook-version}{"\n"}{end}'

Rook Volume Health

Any pod that is using a Rook volume should also remain healthy:

  • The pod should be in the Running state with few, if any, restarts
  • There should be no errors in its logs
  • The pod should still be able to read and write to the attached Rook volume.