Search documentation
karat

+

K

Troubleshooting in Kubernetes clusters

Apollo leverages core Kubernetes and Helm APIs to make changes to the deployed resources in your environment. Apollo gathers much of the information centrally in your Hub to help you operate at scale across clusters. However, Apollo is not intended to be a full replacement for in-cluster tools and expertise. You will eventually need to debug why something is going wrong.

The goal of this guide is to highlight the key commands that are useful in your cluster when troubleshooting problems with your Apollo-managed Helm charts. This is not an exhaustive list, it is intended to be a jumping off point that connects you to valuable resources outside of Apollo.

Kubernetes

You should be familiar with the Kubernetes basics before you start using Apollo. We recommend reviewing the Concepts pages, and trying out some of the Kubernetes tutorials first. You can also refer to the kubectl Cheat Sheet for more information on the commands.

The goal of this guide is to help point you in the right direction when troubleshooting problems when managing Helm charts with Apollo.

The core Kubernetes documentation recommends using minikube when learning about Kubernetes. This works for basic cases, but has limitations that make it more challenging to use Apollo. We recommend that you use kind configured as a basic multi-node cluster.

Get pods

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 kubectl get pods -—all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default wordpress-57577c7cf9-99b8t 1/1 Running 0 3h33m default wordpress-mariadb-0 1/1 Running 0 3h33m infrastructure-test expected-state-k8s-69ccbc7b47-qw9s7 1/1 Running 0 7h20m infrastructure-test expected-state-k8s-69ccbc7b47-rsjwd 1/1 Running 0 7h20m infrastructure-test helm-chart-operator-6986d7cff7-xtb94 1/1 Running 0 7h20m kube-system coredns-6d4b75cb6d-kx5c9 1/1 Running 0 7h29m ... kube-system kube-scheduler-kind-control-plane3 1/1 Running 1 7h26m local-path-storage local-path-provisioner-9cd9bd544-qtth4 1/1 Running 0 7h29m

Describe pod

We recommend using the kubectl describe pod view information about a pod. The Events section at the end of the output can help you debug startup failures if your pod is in a state of Pending, CreateContainerConfigError, or ImagePullBackOff.

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 kubectl describe pod wordpress-584d8d7dbf-gflrl Name: wordpress-584d8d7dbf-gflrl Namespace: default Priority: 0 Service Account: default Node: kind-worker3/172.18.0.3 Start Time: Thu, 23 Nov 2023 09:03:31 -0800 Labels: apollo.palantir.com/status.liveness_and_readiness= app.kubernetes.io/instance=wordpress app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=wordpress app.kubernetes.io/version=6.3.2 helm.sh/chart=wordpress-18.1.0 pod-template-hash=584d8d7dbf Annotations: apollo.palantir.com/metadata.entity.id: aeid:troubleshoot-3x5x:helm-chart:wordpress apollo.palantir.com/metadata.environment.id: troubleshoot-3x5x apollo.palantir.com/metadata.product.group: com.example apollo.palantir.com/metadata.product.name: wordpress apollo.palantir.com/metadata.product.version: 18.1.0 container.seccomp.security.alpha.kubernetes.io/wordpress: runtime/default ... // There's a lot more! Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 8m43s default-scheduler Successfully assigned default/wordpress-584d8d7dbf-gflrl to kind-worker3 Normal Pulling 8m43s kubelet Pulling image "docker.io/bitnami/wordpress:6.3.2-debian-11-r5" Normal Pulled 8m3s kubelet Successfully pulled image "docker.io/bitnami/wordpress:6.3.2-debian-11-r5" in 40.317760997s Normal Created 8m2s kubelet Created container wordpress Normal Started 8m2s kubelet Started container wordpress

Pod logs

Apollo assumes that you have your own way to access logs for your running pods. These logs are often critical to troubleshooting problems with your deployed Helm charts. These logs will not be surfaced in Apollo. Apollo will only show information emitted from Helm or Kubernetes. If your pods started and is in a state of Error or CrashLoopBackOff, you should review the pod logs to diagnose the issue.

If you have at least read access to kubectl in-cluster, you can run kubectl logs using the appropriate pod by NAME from the kubectl get pods command above.

Note that you can use additional flags like --since to limit outputs to more recent and relevant log outputs.

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 kubectl logs wordpress-584d8d7dbf-gflrl --since 1m wordpress 17:04:12.20 wordpress 17:04:12.20 Welcome to the Bitnami wordpress container wordpress 17:04:12.20 Subscribe to project updates by watching https://github.com/bitnami/containers wordpress 17:04:12.21 Submit issues and feature requests at https://github.com/bitnami/containers/issues wordpress 17:04:12.21 wordpress 17:04:12.21 INFO ==> Starting WordPress setup wordpress 17:04:12.25 INFO ==> Generating sample certificates Generating RSA private key, 4096 bit long modulus (2 primes) ......................................................................................................++++ .............................................................................................................................................................................................................++++ e is 65537 (0x010001) Signature ok subject=CN = example.com Getting Private key realpath: /bitnami/apache/conf: No such file or directory wordpress 17:04:14.01 INFO ==> Configuring the HTTP port wordpress 17:04:14.03 INFO ==> Configuring the HTTPS port wordpress 17:04:14.04 INFO ==> Configuring Apache ServerTokens directive wordpress 17:04:14.07 INFO ==> Configuring PHP options wordpress 17:04:14.07 INFO ==> Setting PHP expose_php option wordpress 17:04:14.08 INFO ==> Setting PHP output_buffering option wordpress 17:04:14.11 INFO ==> Validating settings in MYSQL_CLIENT_* env vars wordpress 17:04:14.17 WARN ==> You set the environment variable ALLOW_EMPTY_PASSWORD=yes. For safety reasons, do not use this flag in a production environment. wordpress 17:04:14.36 INFO ==> Ensuring WordPress directories exist wordpress 17:04:14.36 INFO ==> Trying to connect to the database server

Debug running pods

The Kubernetes documentation contains a guide on how to jump into and debug pods of various shapes. There is a lot of variety depending on how you have set up your containers, so we recommend reviewing this guide to identify the right tools for your environments.

Get services

Services are how you allow traffic to your containers, in cluster or externally. If you want to access a website you deployed with a Helm chart, you will need to know the appropriate IP addresses to use to do so. The kubectl get svc command will help find that IP address, or tell you if it is missing.

Copied!
1 2 3 4 5 6 7 kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 104m metallb-webhook-service ClusterIP 10.96.49.198 <none> 443/TCP 23m wordpress LoadBalancer 10.96.31.110 <pending> 80:32314/TCP,443:32281/TCP 20m wordpress-mariadb ClusterIP 10.96.200.146 <none> 3306/TCP 20m

Common issues and troubleshooting

Some Helm charts assume things about your Kubernetes environments for the Helm chart to work properly. Ultimately, this is an issue with the Helm Chart, the environment it is deployed into, or the way that it has been configured. You must choose how to resolve these issues in each of your environments and Helm charts.

This section documents some common problems you may run into, with a recommendation for what to do about it. It is important to remember that these are just starting point options and you must determine the right resolution.

Missing LoadBalancer

A common problem that you may run into is the assumption that a LoadBalancer exists in the environment already. This is provided for you if you are running your Helm chart in a hyperscaler cloud Kubernetes environment like Google Kubernetes Engine (GKE), Azure Kubernetes Service (AKS), or Elastic Kubernetes Service (EKS).

This is a cloud provider offering that will allocate IP addresses to your services. If you are using kind as we recommend for local development, you will not have this, and Helm charts like Wordpress will fail to install without adding supporting resources to the environment.

Service does not have load balancer ingress IP address: default/wordpress

You can choose how to resolve these issues in each of your environments and Helm charts. For a basic local kind cluster, we recommend deploying a LoadBalancer like MetalLB.

Helm charts

Before deploying Helm charts to Apollo, you should be familiar with how Helm charts work generally, and how a specific Helm chart can be deployed to a Kubernetes cluster independent of Apollo. We recommend starting with the official Helm Quickstart Guide. This guide make the connection between what happens in Apollo and what happens in Helm.

Apollo uses Helm to manage Helm chart type Entities deployed in your environment. This means that as long as you have access to your environment, the same Helm commands you would normally use are still valuable.

Helm list

Copied!
1 2 3 4 helm list —all NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION wordpress default 6 2023-11-21 23:00:09 deployed wordpress-18.1.0 6.3.2

Helm Release History

Copied!
1 2 3 4 5 6 7 8 9 helm history wordpress REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION 1 Tue Nov 21 20:30:27 2023 failed wordpress-18.1.0 6.3.2 Release "wordpress" failed: timed out waiting for the condition 2 Tue Nov 21 20:37:33 2023 failed wordpress-18.1.0 6.3.2 Upgrade "wordpress" failed: timed out waiting for the condition 3 Tue Nov 21 22:45:53 2023 failed wordpress-18.1.0 6.3.2 Upgrade "wordpress" failed: timed out waiting for the condition 4 Tue Nov 21 22:52:58 2023 failed wordpress-18.1.0 6.3.2 Upgrade "wordpress" failed: timed out waiting for the condition 5 Tue Nov 21 23:00:04 2023 pending-upgrade wordpress-18.1.0 6.3.2 Preparing upgrade 6 Tue Nov 21 23:00:09 2023 deployed wordpress-18.1.0 6.3.2 Rollback to 4

Helm get values

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 helm get values wordpress USER-SUPPLIED VALUES: apollo: artifactStoreURIs: {} entity: id: aeid:test-environment-lrj3:helm-chart:wordpress environment: id: test-environment-lrj3 product: group: com.example name: wordpress version: 18.1.0 service.type: NodePort

Environments

If you set up your Apollo Environment with a local Kubernetes cluster, you should confirm that your agents are successfully reporting to Apollo. Even though the Pods for the agents are running, they might not be reporting back to Apollo. This will prevent Apollo from issuing Plans for the Environment.

Reapply manifest

One possible solution to this problem is to reapply the manifest onto your Environment. Navigate to your Environment in Apollo, and find the Actions dropdown in the top right. Select Actions to open the dropdown, and then select Environment setup instructions. Follow the instructions in the dialog. Once you have downloaded and applied the manifest, wait for a minute and reload the page.

Create a new Environment

If you quit or restarted Docker, or restarted your computer, the agents in your Environment might be unable to reconnect to Apollo. If you have tried the above troubleshooting steps and your agents are still disconnected, you may need to create a new Environment.