Integrating K8sGPT Operator with Prometheus and Grafana for Enhanced Observability

10 min readDec 16, 2024

The K8sGPT Operator is an intermediary between the Kubernetes control plane, K8sGPT diagnostics workloads, and external observability tools like Prometheus. Here’s a breakdown of its key components and their interactions:

Custom Resource Definition (CRD): The K8sGPT CRD allows users to declaratively define the scope and behavior of K8sGPT diagnostics through YAML manifests.
K8sGPT Operator: The operator constantly monitors for changes in the K8sGPT CRD. It reconciles the desired state defined in the CRD with the cluster’s actual state, ensuring the K8sGPT deployment matches user specifications. It exposes metrics via Prometheus-compatible endpoints and gathers diagnostic metrics and analysis results from the K8sGPT deployment.
K8sGPT Deployment: The K8sGPT deployment is the main diagnostic engine. It interacts directly with the Kubernetes API Server to gather cluster state and resource information and performs AI-driven analysis on the specified resources (e.g., pods, services, deployments).
API Server Integration: The API Server provides a real-time cluster state for the K8sGPT deployment. The K8sGPT deployment uses this data to perform checks, identify anomalies, and analyze cluster health. Feedback is sent back to the operator for further processing.
Prometheus Integration: Prometheus scrapes metrics exposed by the K8sGPT Operator. These metrics can be visualized using dashboards like Grafana or used for alerting purposes.
Results Handling: Once diagnostics are performed, the results are:

Processed and stored by the operator.
Outputted to user-defined destinations (e.g., logs, dashboards).
Integrated into workflows like CI/CD pipelines for automated actions.

Prerequisites for Setting Up k8sGPT Operator

Before diving into the setup, let’s discuss the prerequisites:

1. A Kubernetes Cluster

The first and most obvious requirement is access to a Kubernetes cluster. This operator runs within the cluster, so it’s a fundamental need.

NOTE: If you’re setting up a development environment with a Kubernetes cluster, you can use a lightweight tool like kind (Kubernetes IN Docker). It’s simple to use and ideal for local development. Check out the quick start guide to get started https://kind.sigs.k8s.io/docs/user/quick-start/ .

2. Helm

Helm is a package manager for Kubernetes that simplifies the deployment of charts, including the k8sGPT operator and Prometheus.We’ll use Helm 3. For more information how to install helm please check this doc https://helm.sh/docs/intro/install/

3. Prometheus

The k8sGPT operator relies on Prometheus for metrics scraping. Specifically, it uses a ServiceMonitor, which is deployed as a Custom Resource Definition (CRD) when Prometheus is installed. This makes Prometheus a critical dependency.

For more information, please check this guide https://docs.k8sgpt.ai/getting-started/in-cluster-operator/

Let start with Prometheus installation

Add Prometheus Community Repository

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
"prometheus-community" has been added to your repositories

This will add the Prometheus Community Helm chart repository to your local Helm client for access to Prometheus-related charts.

Update Helm Repository Cache

helm repo update
Hang tight while we grab the latest from your chart repositories…
…Successfully got an update from the "k8sgpt" chart repository
…Successfully got an update from the "prometheus-community" chart repository
Update Complete. ⎈Happy Helming!⎈

This will update the local Helm chart repository cache to fetch the latest chart versions from all added repositories.

Install or Upgrade Kube-Prometheus-Stack

helm upgrade --install prometheus \                                                  
prometheus-community/kube-prometheus-stack  \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--wait
Release "prometheus" does not exist. Installing it now.
NAME: prometheus
LAST DEPLOYED: Fri Dec  6 10:27:44 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
  kubectl --namespace default get pods -l "release=prometheus"
Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.

This will install (or upgrades if already installed) the kube-prometheus-stack chart from the Prometheus Community repository, configuring it with a specific setting and waiting for the deployment to complete.

The value

--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false

ensures that Prometheus only scrapes metrics from ServiceMonitors that match a specific selector defined in the configuration, rather than automatically scraping all ServiceMonitors in the cluster. This provides better control over which metrics are collected, making it useful in environments where you want to limit or customize Prometheus’s monitoring scope.

Once the Helm chart for the Prometheus stack is successfully installed, you will notice that all the essential pods and services are up and running within your Kubernetes cluster. These include components like Prometheus, Grafana, Alertmanager, and Node Exporter, each playing a critical role in monitoring and observability

Pods:

kubectl get pod
NAME                                                     READY   STATUS    RESTARTS   AGE
alertmanager-prometheus-kube-prometheus-alertmanager-0   2/2     Running   0          2m29s
prometheus-grafana-578946f5d5-4rzw7                      3/3     Running   0          2m46s
prometheus-kube-prometheus-operator-747d745469-558bw     1/1     Running   0          2m46s
prometheus-kube-state-metrics-6489887dc-q9z5l            1/1     Running   0          2m46s
prometheus-prometheus-kube-prometheus-prometheus-0       2/2     Running   0          2m28s
prometheus-prometheus-node-exporter-rcw2q                1/1     Running   0          2m46s

Each pod listed serves a specific purpose:

Alertmanager: Handles alert notifications.
Grafana: Provides a web-based dashboard for visualizing metrics.
Operator: Manages the lifecycle of Prometheus, Alertmanager, and related resources.
Kube-State-Metrics: Exposes Kubernetes object metrics.
Node-Exporter: Collects hardware and OS metrics.
Prometheus: The main Prometheus instance for scraping and storing metrics.

Services:

kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 3m52s
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 13m
prometheus-grafana ClusterIP 10.96.55.220 <none> 80/TCP 4m9s
prometheus-kube-prometheus-alertmanager ClusterIP 10.96.144.133 <none> 9093/TCP,8080/TCP 4m9s
prometheus-kube-prometheus-operator ClusterIP 10.96.246.18 <none> 443/TCP 4m9s
prometheus-kube-prometheus-prometheus ClusterIP 10.96.7.226 <none> 9090/TCP,8080/TCP 4m9s
prometheus-kube-state-metrics ClusterIP 10.96.5.182 <none> 8080/TCP 4m9s
prometheus-operated ClusterIP None <none> 9090/TCP 3m51s
prometheus-prometheus-node-exporter ClusterIP 10.96.19.255 <none> 9100/TCP 4m9s

These services are crucial for exposing the various components of the Prometheus stack. For example:

prometheus-grafana: Exposes the Grafana dashboard on port 80, providing a rich visualization layer for cluster metrics.
prometheus-kube-prometheus-prometheus: Exposes the Prometheus UI on port 9090, enabling direct access to metrics and query capabilities.
prometheus-kube-prometheus-alertmanager: Exposes the Alertmanager UI on port 9093, where you can view and manage alert notifications.
prometheus-kube-prometheus-operator: Exposes the operator on port 443, managing the lifecycle of Prometheus and its related resources.
prometheus-kube-state-metrics: Provides Kubernetes object metrics through port 8080, enabling insights into the state of cluster resources.
prometheus-prometheus-node-exporter: Exposes the Node Exporter on port 9100, gathering hardware and OS-level metrics from cluster nodes.

These services are accessible within the cluster and can be forwarded to your local machine for further exploration, forming the backbone for monitoring and observability in Kubernetes.

Accessing the Dashboards

Access Prometheus UI

To access Prometheus, forward the service port to your local machine:

kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090  
Forwarding from 127.0.0.1:9090 -> 9090
Forwarding from [::1]:9090 -> 9090

Open a browser and navigate to http://localhost:9090.

In the Prometheus UI, go to the “Status” section under the “Target health” menu to ensure Prometheus is scraping metrics from all configured endpoints.

Access Grafana UI

Forward the Grafana service port to your local machine:

kubectl port-forward svc/prometheus-grafana 3000:80

Forwarding from 127.0.0.1:3000 -> 3000
Forwarding from [::1]:3000 -> 3000

Open a browser and navigate to http://localhost:3000.

Default credentials:

Username: admin
Password: prom-operator

Access Alertmanager UI

Forward the Alertmanager service port:

kubectl port-forward svc/prometheus-kube-prometheus-alertmanager 9093:9093 -n monitoring

Deploying the k8sGPT Operator

Once Prometheus is set up, we can move on to installing the k8sGPT operator. This involves the following steps:

Add the k8sGPT Helm Repository and update it

First, add the k8sGPT Helm repository to your local Helm repositories:

helm repo add k8sgpt https://charts.k8sgpt.ai
helm repo update

Install the K8sGPT Operator with observability features enabled:

helm install release k8sgpt/k8sgpt-operator -n k8sgpt-operator-system --create-namespace --set interplex.enabled=true --set grafanaDashboard.enabled=true --set serviceMonitor.enabled=true

This Helm command installs the k8sGPT operator in a Kubernetes cluster, with additional configurations to enhance monitoring and alerting capabilities. Notably, the command enables the ServiceMonitor and Grafana dashboard, allowing the Prometheus operator to discover the ServiceMonitors created by the k8sGPT operator. Additionally, a default Grafana dashboard is provisioned by the k8sGPT operator, specifically designed to monitor its own operations.

After installing the k8sGPT operator, we can verify that the operator’s components are up and running in the k8sgpt-operator-system namespace using the following commands:

Check the Pods:

kubectl get pod -n k8sgpt-operator-system
NAME READY STATUS RESTARTS AGE
release-interplex-0                                           1/1     Running   0          102s
release-k8sgpt-operator-controller-manager-555d4f49fd-t77bj 2/2 Running 0 102s

The output shows the pod k8sgpt-k8sgpt-operator-controller-manager with a READY status of 2/2, indicating that both containers in the pod are running without issues. This pod is the primary controller for managing k8sGPT operations, such as analyzing cluster health and reporting issues.

2. Check the Services:

kubectl get svc -n k8sgpt-operator-system 
NAME                                                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
release-interplex-service                                 ClusterIP   None            <none>        8084/TCP   3m1s
release-k8sgpt-opera-controller-manager-metrics-service   ClusterIP   10.96.158.136   <none>        8443/TCP   3m1s

The output lists the service k8sgpt-k8sgpt-operat-controller-manager-metrics-service. This ClusterIP service exposes the operator’s metrics endpoint on port 8443. Prometheus can use this service to scrape metrics related to the k8sGPT operator, enabling continuous monitoring of its performance and activities.

NOTE: The release-interplex pod and its corresponding service are additional components deployed as part of the k8sgpt-operator installation. When you install the k8sgpt-operator, it include certain sub-charts or supporting services that handle data storage, indexing, or backend functionality to support the operator’s AI-driven diagnostics.

Configuring the k8sGPT Operator

The k8sGPT operator requires additional configuration to function correctly. This includes setting up secrets and defining a custom configuration.

1. Create a Kubernetes Secret

The k8sGPT operator uses an OpenAI backend. We need to create a Kubernetes secret with the API key and endpoint.

export OPENAI_TOKEN=”Replace it with open openai token”

Create a Kubernetes secret with the OpenAI API key. This is required to access the OpenAI API. The secret can be created using the following command:

kubectl create secret generic k8sgpt-openai-secret --from-literal=OPENAI_TOKEN=$OPENAI_TOKEN -n k8sgpt-operator-system
secret/k8sgpt-openai-secret created

To deploy a K8sGPT resource, you will need to create a YAML file with the following contents:

kubectl apply -f - <<EOF
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
  name: k8sgpt-sample
  namespace: k8sgpt-operator-system
spec:
  ai:
    enabled: true
    model: gpt-3.5-turbo
    backend: openai
    secret:
      name: k8sgpt-openai-secret
      key: OPENAI_TOKEN
  noCache: false
  version: v0.3.48
EOF

After applying the configuration for the k8sgpt-sample resource, additional resources are created in the k8sgpt-operator-system namespace, as seen with the new pod k8sgpt-sample-85d5bbdd-tf54q, which is managed by the corresponding deployment and replicaset. A new service k8sgpt-sample is also created, exposing the pod on port 8080, enabling communication and monitoring for this instance. Alongside these, the existing operator resources, such as the k8sgpt-k8sgpt-operator-controller-manager pod and its metrics service, remain operational to manage the overall k8sGPT setup. These additions confirm the successful deployment of the k8sgpt-sample instance, ready for workload monitoring and debugging.

kubectl get all -n k8sgpt-operator-system
NAME                                                              READY   STATUS    RESTARTS   AGE
pod/k8sgpt-sample-85d5bbdd-tf54q                                  1/1     Running   0          25s
pod/release-interplex-0                                           1/1     Running   0          11m
pod/release-k8sgpt-operator-controller-manager-555d4f49fd-t77bj   2/2     Running   0          11m

NAME                                                              TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/k8sgpt-sample                                             ClusterIP   10.96.215.192   <none>        8080/TCP   25s
service/release-interplex-service                                 ClusterIP   None            <none>        8084/TCP   11m
service/release-k8sgpt-opera-controller-manager-metrics-service   ClusterIP   10.96.158.136   <none>        8443/TCP   11m

NAME                                                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/k8sgpt-sample                                1/1     1            1           25s
deployment.apps/release-k8sgpt-operator-controller-manager   1/1     1            1           11m

NAME                                                                    DESIRED   CURRENT   READY   AGE
replicaset.apps/k8sgpt-sample-85d5bbdd                                  1         1         1       25s
replicaset.apps/release-k8sgpt-operator-controller-manager-555d4f49fd   1         1         1       11m

NAME                                 READY   AGE
statefulset.apps/release-interplex   1/1     11m

With these steps completed, the k8sGPT operator should now be fully operational, and the k8sGPT model successfully created. You can access the results generated by the k8sGPT model using the following command:

kubectl get results -n k8sgpt-operator-system -o json | jq .
{
  "apiVersion": "v1",
  "items": [
    {
      "apiVersion": "core.k8sgpt.ai/v1alpha1",
      "kind": "Result",
      "metadata": {
        "creationTimestamp": "2024-12-16T23:42:57Z",
        "generation": 1,
        "labels": {
          "k8sgpts.k8sgpt.ai/backend": "openai",
          "k8sgpts.k8sgpt.ai/name": "k8sgpt-sample",
          "k8sgpts.k8sgpt.ai/namespace": "k8sgpt-operator-system"
        },
        "name": "defaultbadpod",
        "namespace": "k8sgpt-operator-system",
        "resourceVersion": "4178",
        "uid": "95fa4d4c-3f6c-42d1-89db-4b99c78c5c02"
      },
      "spec": {
        "backend": "openai",
        "details": "Error: Failed to pull and unpack image \"docker.io/library/nginxnew:latest\": pull access denied, repository does not exist or may require authorization.\n\nSolution: \n1. Check if the image repository \"library/nginxnew\" exists on docker.io.\n2. Ensure you have the necessary permissions to access the repository.\n3. If authorization is required, authenticate with the registry using docker login.\n4. Retry pulling the image after ensuring proper access and permissions.",
        "error": [
          {
            "text": "failed to pull and unpack image \"docker.io/library/nginxnew:latest\": failed to resolve reference \"docker.io/library/nginxnew:latest\": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed"
          }
        ],
        "kind": "Pod",
        "name": "default/badpod",
        "parentObject": ""
      }
    }

Once everything is configured, you can examine essential metrics, such as the findings generated by K8sGPT and detailed operator workload statistics. These will give you valuable insights into resource usage and overall efficiency, which you can easily review within a Grafana dashboard. Go to grafana url localhost:3000 and under dashboard link click on K8sGPT Overview.

Conclusion:

Following the steps outlined above, you have successfully integrated the K8sGPT Operator into your Kubernetes environment, complete with Prometheus-based observability and Grafana-powered visualization. Starting from the initial prerequisites, ensuring you have a Kubernetes cluster, Helm, and Prometheus. You have established a fully functional, AI-driven diagnostics environment within your cluster through the Operator and custom resource deployment.

With this setup, the K8sGPT Operator continuously monitors cluster resources, surfaces potential issues, and integrates directly with your observability stack. Prometheus collects key metrics, while Grafana provides intuitive dashboards for real-time insights. Leveraging AI-driven analysis, K8sGPT empowers engineers to quickly troubleshoot issues, streamline deployments, and improve the overall health and efficiency of Kubernetes workloads.

As your infrastructure grows and evolves, you can refine the K8sGPT configuration to target specific resources, adjust the operator’s behavior, and tailor its diagnostics to align with your operational needs. The combination of Kubernetes, K8sGPT, Prometheus, and Grafana lays a robust foundation for proactive cluster management, ensuring stable, reliable, and informed operations at scale.