Search documentation
karat

+

K

ReferenceApollo Product SpecificationObservabilityHealth

Health [Beta]

Health is an advanced workflow that is not generally available. If you wish to use it, contact your Palantir representative to learn more.

This extension of the specification defines service replica health and how service status will be communicated to a product.

Version 2.0

The following specification is Version 2 of the Apollo Product health specification. Note that the JSON objects defined in version 2.0 do not have a field that indicates the version of the object.

Requirements for custom health checks

expected-state-k8s is a Spoke service that collects state and health information from your cluster and forwards it to your Apollo Hub. It watches any Entity installed with Apollo and collects health if it finds the health.probe annotation on a Pod. expected-state-k8s must be installed in your cluster for custom health checks to be forwarded to your Apollo Hub.

For expected-state-k8s to discover the health endpoints, you should add the following annotation to the pod spec template section of your Deployment or StatefulSet in the Helm chart where you want to enable health reporting.

Annotation

The health.probe annotation informs expected-state-k8s where to probe the container to get the health status. It has the following format:

Copied!
1 apollo.palantir.com/health.probe: '{"containerProbes":{"<CONTAINER_NAME>":{"path":"/<CONTEXT_PATH>","port":"<PORT>"}}}
  • CONTAINER_NAME is the name of the container where the health endpoint is served.
  • CONTEXT_PATH is the path where the /status/health endpoint is implemented on your container. For example, if your endpoint is served at /hello-world/status/health, then you should enter /hello-world for the CONTEXT_PATH . If you are serving your endpoint directly at /status/health, you can set this value to be /.
  • PORT is the port that your container is serving the <CONTEXT_PATH>/status/health endpoint on.

Below is an example Helm chart with these labels and annotations applied for an example Product:

  • The container created within the test-deployment deployment is named test-container.
  • The test-container serves the health status endpoint at /telemetry/status/health on port 8446.
Copied!
1 2 3 4 5 6 7 8 kind: Deployment metadata: name: test-deployment spec: template: metadata: annotations: apollo.palantir.com/health.probe: '{"containerProbes":{"test-container":{"path":"/telemetry","port":"8446"}}}'

Once expected-state-k8s has seen the annotation on your Pod, it will query the health endpoint configured by the apollo.palantir.com/health.probe annotation. All queries will be performed with HTTPS, so your endpoint must accept TLS-encrypted traffic.

Network policies

Depending on your cluster's firewall configuration, you may need to apply network policies so that expected-state-k8s can access your Pods.

Below are two example Kubernetes NetworkPolicies that allow expected-state-k8s access to your Pods in your application's namespace. The first allows ingress from the Spoke Control Plane namespace to your application's namespace:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-ingress-policy namespace: <YOUR_APP_NAMESPACE> spec: podSelector: matchLabels: apollo.palantir.com/status.liveness_and_readiness: "" policyTypes: - Ingress ingress: - from: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: <APOLLO_CONTROL_PLANE_NAMESPACE>

The second allows egress from the Spoke Control Plane namespace to your application's namespace.

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-egress-policy namespace: <APOLLO_CONTROL_PLANE_NAMESPACE> spec: podSelector: matchLabels: apollo.palantir.com/status.liveness_and_readiness: "" policyTypes: - Egress egress: - to: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: <YOUR_APP_NAMESPACE>
  • YOUR_APP_NAMESPACE is the namespace where your Deployment or StatefulSet is running.
  • APOLLO_CONTROL_PLANE_NAMESPACE is the namespace where the Spoke Control Plane is running.
  • apollo.palantir.com/status.liveness_and_readiness: "" is the label that Apollo automatically applies to managed Pods.

For more information about network policies, review the Kubernetes documentation.

Replica Health

Health is a service replica level indicator of a service's operational health. The following states are defined:

  • HEALTHY: The service replica is fully operational with no issues.
  • REPAIRING: The service replica is operating in a degraded state, but is capable of automatically recovering. If any of the replicas in the service were to be restarted, it may result in a state transition failing. For example, when a Cassandra replica is bootstrapping into a cluster for the first time the bootstrap will fail if it or any other replica in the cluster is restarted during this period.
  • WARNING: The service replica is in a state that is trending towards an error. If no corrective action is taken, the health is expected to become an ERROR.
  • ERROR: The service replica is operationally unhealthy.

Calculating Service Replica Health

A replica’s health is determined by the most severe health state of all of its health checks, where the health states are ordered, from least severe to most severe as follows:

  1. HEALTHY
  2. REPAIRING
  3. WARNING
  4. ERROR

Implementation

A service should implement a REST endpoint which accepts a GET request. The endpoint should meet the following requirements:

  • Can be called every 5 seconds without any impact to performance. For resource-intensive health checks that should be called less frequently, the service is expected to cache results of health checks accordingly.
  • Be accessible at a path of /status/health relative to the service's configured apollo-spec-status-endpoint (see Endpoints); for example, https://<host>:<port>/<apollo-spec-status-endpoint>/status/health. The endpoint can be on a different port than the service's primary endpoint to allow for side-cars for open source software, but should be on the primary port for Palantir-developed software.

The endpoint is expected to return an HTTP 200 OK response code for all health states. If the endpoint returns a 5xx error code, this will be interpreted as an internal server error when performing the health check, and is equivalent to the service returning an ERROR.

A service may include many independent health checks but should represent the aggregate health status with a single endpoint.

Status Metadata

A service is expected to provide metadata describing the status of its current state by responding with a valid JSON body to the HTTP GET request.

Each health check should be represented by a Check object with the following fields:

  • type: string, a constant representing the type of health check. Values should be non-empty, uppercase, underscore delimited, ASCII letters with no spaces, ([A-Z_])
  • state: HealthState, one of the health states listed above.
  • message: string, text describing the state of the check which should provide enough information for the check to be actionable when included in an alert. Any information presented in the message should be redacted and safe to move out of the environment.
  • params: map<string, any>, additional information on the nature of the health check. These should also be redacted and safe to move out of the environment.

The JSON body should include the following:

  • checks: map<string, Check>, a map of check types to Check objects describing the state of each health check implemented by the endpoint. The keys correspond to the type field in the Check. The overall state of the replica should correspond to the most severe state of all its health checks.

The full JSON response must be smaller than a maximum length of 10 KB (10,000 bytes). If a response is larger than this limit, the reporting infrastructure may remove check messages and parameters before storing the payload. If a response is modified to due to length, the reporter will add an additional HEALTH_CHECK_MAX_LEN check in WARNING state.

Examples

Healthy
Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 { "checks": { "SOME_BACKGROUND_JOB": { "type": "SOME_BACKGROUND_JOB", "state": "HEALTHY", "message": "Job running healthy", "params": { "lastSuccessTimestamp": 1052240400, "itemsProcessed": 9001 } } } }
Repairing
Copied!
1 2 3 4 5 6 7 8 9 { "checks": { "REPAIR": { "type": "REPAIR", "state": "REPAIRING", "message": "Currently running repair" } } }
Warning
Copied!
1 2 3 4 5 6 7 8 9 { "checks": { "DISK_CAPACITY": { "type": "DISK_CAPACITY", "state": "WARNING", "message": "low disk capacity available. 25% available, but compactions may require up to 30%" } } }
Error
Copied!
1 2 3 4 5 6 7 8 9 { "checks": { "QUORUM_READ_AND_WRITE": { "type": "QUORUM_READ_AND_WRITE", "state": "ERROR", "message": "Failed quorum reads. Unable to ensure data consistency" } } }