Health is an advanced workflow that is not generally available. If you wish to use it, contact your Palantir representative to learn more.
This extension of the specification defines service replica health and how service status will be communicated to a product.
The following specification is Version 2 of the Apollo Product health specification. Note that the JSON objects defined in version 2.0 do not have a field that indicates the version of the object.
expected-state-k8s
is a Spoke service that collects state and health information from your cluster and forwards it to your Apollo Hub. It watches any Entity installed with Apollo and collects health if it finds the health.probe
annotation on a Pod. expected-state-k8s
must be installed in your cluster for custom health checks to be forwarded to your Apollo Hub.
For expected-state-k8s
to discover the health endpoints, you should add the following annotation to the pod spec template
section of your Deployment or StatefulSet in the Helm chart where you want to enable health reporting.
The health.probe
annotation informs expected-state-k8s
where to probe the container to get the health status. It has the following format:
Copied!1
apollo.palantir.com/health.probe: '{"containerProbes":{"<CONTAINER_NAME>":{"path":"/<CONTEXT_PATH>","port":"<PORT>"}}}
CONTAINER_NAME
is the name of the container where the health endpoint is served.CONTEXT_PATH
is the path where the /status/health
endpoint is implemented on your container. For example, if your endpoint is served at /hello-world/status/health
, then you should enter /hello-world
for the CONTEXT_PATH
. If you are serving your endpoint directly at /status/health
, you can set this value to be /
.PORT
is the port that your container is serving the <CONTEXT_PATH>/status/health
endpoint on.Below is an example Helm chart with these labels and annotations applied for an example Product:
test-deployment
deployment is named test-container
.test-container
serves the health status endpoint at /telemetry/status/health
on port 8446.Copied!1 2 3 4 5 6 7 8
kind: Deployment metadata: name: test-deployment spec: template: metadata: annotations: apollo.palantir.com/health.probe: '{"containerProbes":{"test-container":{"path":"/telemetry","port":"8446"}}}'
Once expected-state-k8s
has seen the annotation on your Pod, it will query the health endpoint configured by the apollo.palantir.com/health.probe
annotation. All queries will be performed with HTTPS, so your endpoint must accept TLS-encrypted traffic.
Depending on your cluster's firewall configuration, you may need to apply network policies so that expected-state-k8s
can access your Pods.
Below are two example Kubernetes NetworkPolicies
that allow expected-state-k8s
access to your Pods in your application's namespace. The first allows ingress from the Spoke Control Plane namespace to your application's namespace:
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-ingress-policy namespace: <YOUR_APP_NAMESPACE> spec: podSelector: matchLabels: apollo.palantir.com/status.liveness_and_readiness: "" policyTypes: - Ingress ingress: - from: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: <APOLLO_CONTROL_PLANE_NAMESPACE>
The second allows egress from the Spoke Control Plane namespace to your application's namespace.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-egress-policy namespace: <APOLLO_CONTROL_PLANE_NAMESPACE> spec: podSelector: matchLabels: apollo.palantir.com/status.liveness_and_readiness: "" policyTypes: - Egress egress: - to: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: <YOUR_APP_NAMESPACE>
YOUR_APP_NAMESPACE
is the namespace where your Deployment or StatefulSet is running.APOLLO_CONTROL_PLANE_NAMESPACE
is the namespace where the Spoke Control Plane is running.apollo.palantir.com/status.liveness_and_readiness: ""
is the label that Apollo automatically applies to managed Pods.For more information about network policies, review the Kubernetes documentation.
Health is a service replica level indicator of a service's operational health. The following states are defined:
HEALTHY
: The service replica is fully operational with no issues.REPAIRING
: The service replica is operating in a degraded state, but is capable of automatically recovering. If any of the replicas in the service were to be restarted, it may result in a state transition failing. For example, when a Cassandra replica is bootstrapping into a cluster for the first time the bootstrap will fail if it or any other replica in the cluster is restarted during this period.WARNING
: The service replica is in a state that is trending towards an error. If no corrective action is taken, the health is expected to become an ERROR
.ERROR
: The service replica is operationally unhealthy.A replica’s health is determined by the most severe health state of all of its health checks, where the health states are ordered, from least severe to most severe as follows:
HEALTHY
REPAIRING
WARNING
ERROR
A service should implement a REST endpoint which accepts a GET
request. The endpoint should meet the following requirements:
/status/health
relative to the service's configured apollo-spec-status-endpoint (see Endpoints); for example, https://<host>:<port>/<apollo-spec-status-endpoint>/status/health
. The endpoint can be on a different port than the service's primary endpoint to allow for side-cars for open source software, but should be on the primary port for Palantir-developed software.The endpoint is expected to return an HTTP 200 OK response code for all health states. If the endpoint returns a 5xx error code, this will be interpreted as an internal server error when performing the health check, and is equivalent to the service returning an ERROR
.
A service may include many independent health checks but should represent the aggregate health status with a single endpoint.
A service is expected to provide metadata describing the status of its current state by responding with a valid JSON body to the HTTP GET request.
Each health check should be represented by a Check
object with the following fields:
type
: string, a constant representing the type of health check. Values should be non-empty, uppercase, underscore delimited, ASCII letters with no spaces, ([A-Z_])state
: HealthState, one of the health states listed above.message
: string, text describing the state of the check which should provide enough information
for the check to be actionable when included in an alert. Any information presented in the message should be redacted and safe to move out of the environment.params
: map<string, any>, additional information on the nature of the health check. These should also be redacted and safe to move out of the environment.The JSON body should include the following:
checks
: map<string, Check>, a map of check types to Check
objects describing the state of each health check implemented by the endpoint. The keys correspond to the type
field in the Check. The overall state of the replica should correspond to the most severe state of all its health checks.The full JSON response must be smaller than a maximum length of 10 KB (10,000 bytes). If a response is larger than this limit, the reporting infrastructure may remove check messages and parameters before storing the payload. If a response is modified to due to length, the reporter will add an additional HEALTH_CHECK_MAX_LEN
check in WARNING
state.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13
{ "checks": { "SOME_BACKGROUND_JOB": { "type": "SOME_BACKGROUND_JOB", "state": "HEALTHY", "message": "Job running healthy", "params": { "lastSuccessTimestamp": 1052240400, "itemsProcessed": 9001 } } } }
Copied!1 2 3 4 5 6 7 8 9
{ "checks": { "REPAIR": { "type": "REPAIR", "state": "REPAIRING", "message": "Currently running repair" } } }
Copied!1 2 3 4 5 6 7 8 9
{ "checks": { "DISK_CAPACITY": { "type": "DISK_CAPACITY", "state": "WARNING", "message": "low disk capacity available. 25% available, but compactions may require up to 30%" } } }
Copied!1 2 3 4 5 6 7 8 9
{ "checks": { "QUORUM_READ_AND_WRITE": { "type": "QUORUM_READ_AND_WRITE", "state": "ERROR", "message": "Failed quorum reads. Unable to ensure data consistency" } } }