Setup scalable graylog on Azure Kubernetes (AKS) with Private IP and Nginx Ingress Controller

Recently I have been tasked with standing up a graylog cluster for our new development team. Even though I have stood up a highly scalable graylog on a regular IaaS/Virtual machine, utilizing Azure AKS was not something that I have done in the past. I was always skeptical about how would Azure handle the data volume, so this weekend I was ready to spend some time trying to play with Azure AKS (and ended up spending the entire weekend)

TL;DR

No time to read the entire post? Here is the github contains all the scripts
https://github.com/getvivekv/graylog-ha-azure-aks

Requirements

Graylog on a private network
A FQDN assigned with a valid / custom SSL certificate
Use our existing Subnet on a virtual network
Ability to grow elastic search data volume without data loss.
Highly available structure for graylog, elasticsearch and MongoDB so that the underlying infrastructure issue won’t cause downtime to the program.

Plan

The plan is to setup the graylog, elastic search and mongodb as follows

MongoDB replicaset: 3 pods
Elastic search master : 3 pods
Elastic search client : 3 pods
Elastic search data : 3 pods with 500GB expandable volume
Graylog : 2 pods with the ability to scale up
Azure Private Load Balancer
Point a domain to the private load balancer
Enable UDP port for Graylog Input and accept graylog traffic on Private IP

Preparation

Due to a bug in the Azure-Cli I was not able to create the Azure Kubernetes cluster using cli on an existing subnet. Hence I had to create the kubernetes cluster manually through the Azure Portal. Creating the cluster through Azure portal wasn’t that hard. You can create a cluster with public or private endpoint. One thing that I don’t like in AKS is that it uses Azure’s CNI which reserves A LOT of IP address to assign each IP per pod. So be sure to select a subnet that has plenty of IP address. I ended up assigned an entire /24 block for this kubernetes cluster.

The following article assumes that the cluster is public hence we will setup a private load balancer in our own subnet in order to deploy our internal applications.

Assuming that the Kubernetes name is devops-tools and the resource group name is devops-tools-rg

# Set the correct subscription. Skip this step if this is already set in your cli
az account set --subscription "your-subscription"

# Get the kubernetes creds and configure .kube/config. This assumes that you have already logged into your Azure environment with az login

az aks get-credentials --resource-group devops-tools-rg --name devops-tools

Nginx Ingress Controller

I was so excited reading about the Traefik ingress controller. However, after spending two days and nights I found that its new version had a bug due to which it is not generating HTTPS route automatically. So I ended up using Nginx Ingress Controller which now works just great.

Why ingress controller?

Kubernetes Ingress Controller provides us a way to assign a single entry point for all of our application. The ingress controller handles the HTTP routing and SSL termination so that we don’t have to worry about spending time configuring an Azure Application Gateway / Load balancer.

Install Nginx Ingress Controller

Installing nginx controller using helm is easy. Assuming that you have Helm installed and configured to fetch stable release

Create a file called internal-ingress.yaml and copy/paste the below contents. Be sure to change the loadBalancerIP to a free IP address in your subnet.

controller:
  nodeSelector:
    beta.kubernetes.io/os: "linux"
  replicaCount: 2
  service:
    loadBalancerIP: 172.23.8.100
    annotations:
      service.beta.kubernetes.io/azure-load-balancer-internal: "true"
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 80
    targetMemoryUtilizationPercentage: 80
  defaultBackend:
    nodeSelector:
      beta.kubernetes.io/os: "linux"

helm repo add nginx-stable https://helm.nginx.com/stable
helm repo update
helm install nginx-internal stable/nginx-ingress \
    --create-namespace \
    --namespace ingress-internal \
    -f internal-ingress.yaml \
    --set controller.ingressClass=nginx-internal

The above Helm command is specific to version 3. If you are on older version, please refer this page

This would stand up a Private Load balancer on Azure with the private IP 172.23.8.100

Create a namespace and switch

I like to have a namespace for each application as the logical separation so that incase if I don’t want to keep the resources, I could simply delete the namespace instead of worrying about removing individual workload

kubectl create namespace $namespace
kubectl config set-context $(kubectl config current-context) --namespace=$namespace

MongoDB setup

This setup uses a MongoDB replicaset with no authentication. This means the containers in the cluster may able to access the mongodb (not from outside the cluster). In a perfect secure setup you would need to consider setting up db authentication. It uses MongoDB Helm Chart which supports adding user auth
https://github.com/helm/charts/tree/master/stable/mongodb-replicaset

helm install mongodb stable/mongodb-replicaset --namespace $namespace -f mongodb.yaml

Elastic search setup

This is the interesting part where I spent hours and hours to try to setup and test to make sure that the data volume can be expanded without dataloss.

As indicated above, this uses an unauthenticated connection. In a perfect setup you should setup authentication. Access from outside the cluster is not permitted with the current setup.

File elasticsearch-client.yaml

---
clusterName: "elasticsearch"
imageTag: 6.8.9
nodeGroup: "client"
roles:
  master: "false"
  ingest: "false"
  data: "false"
replicas: 2
esJavaOpts: "-Xmx2g -Xms2g"
resources:
  requests:
    cpu: "100m"
    memory: "2Gi"
  limits:
    cpu: "2000m"
    memory: "5Gi"
volumeClaimTemplate:
  resources:
    requests:
      storage: 1Gi
persistence:
  enabled: false

file: elasticsearch-data.yaml

---
clusterName: "elasticsearch"
imageTag: 6.8.9
nodeGroup: "data"
roles:
  master: "false"
  ingest: "true"
  data: "true"
replicas: 3

esJavaOpts: "-Xmx1g -Xms1g"
resources:
  requests:
    cpu: "50m"
    memory: "1Gi"
volumeClaimTemplate:
  resources:
    requests:
      storage: "500Gi"

file: elasticsearch-master.yaml

---
clusterName: "elasticsearch"
imageTag: 6.8.9
nodeGroup: "master"
roles:
  master: "true"
  ingest: "false"
  data: "false"
replicas: 3
esJavaOpts: "-Xmx3g -Xms3g"
resources:
  requests:
    cpu: "500m"
    memory: "3Gi"
  limits:
    cpu: "2000m"
    memory: "8Gi"
volumeClaimTemplate:
  resources:
    requests:
      storage: 4Gi

# Install Elasticsearch
helm upgrade --install --namespace $namespace --values ./elasticsearch-master.yaml elasticsearch-master elastic/elasticsearch
helm upgrade --install --namespace $namespace --values ./elasticsearch-client.yaml elasticsearch-client elastic/elasticsearch
helm upgrade --install --namespace $namespace --values ./elasticsearch-data.yaml elasticsearch-data elastic/elasticsearch

SSL Certificate

This assumes that you already have your own SSL certificate. If you do not have one, you can purchase a new SSL cert or get one free of cost from LetsEncrypt using DNS challenge. For LetsEncrypt, this doesn’t cover auto reviewal and all. For our company, we don’t use letsencrypt and have our own paid ssl certificate.

# Import SSL certificate to Kubernetes and create as a secret 
kubectl create secret tls mydomain-ssl --key mydomain.key --cert mydomain.cert --namespace $namespace

Now, finally create a file called “graylog.yaml” with the following content

graylog:
  metadataLabels:
  image:
    repository: "graylog/graylog:3.3"
    pullPolicy: "IfNotPresent"
  replicas: 2
  persistence:
    enabled: true
    accessMode: ReadWriteOnce
    size: "10Gi"
  input:
    udp:
      service:
        type: NodePort
      ports:
        - name: getlf-30010-udp
          port: 30010
          nodePort: 30010
  service:
    annotations:
    type: ClusterIP
    port: 9000
  ingress:
    annotations:
      kubernetes.io/ingress.class: nginx-internal
    enabled: true
    hosts:
      - graylog.mydomain.com
    tls:
      - secretName: mydomain-ssl
        hosts:
          - graylog.mydomain.com
  resources:
    limits:
      cpu: "1"
    requests:
      cpu: "100m"
      memory: "1Gi"
  geoip:
    enabled: true

  ## Set Graylog Java heapsize. If this value empty, chart will allocate heapsize using `-XX:+UseCGroupMemoryLimitForHeap`
  ## ref: https://blogs.oracle.com/java-platform-group/java-se-support-for-docker-cpu-and-memory-limits
  ##
  # heapSize: "1024g"

  ## Additional graylog config which is defined on `graylog.conf`.
  ## You can find a complete list of graylog config from http://docs.graylog.org/en/3.0/pages/configuration/server.conf.html
  ## Graylog config is written in Java properites format. Make sure you write it correctly.
  ##
  # config: |
  #   elasticsearch_connect_timeout = 10s
  #   elasticsearch_socket_timeout = 60s
  #   elasticsearch_idle_timeout = -1s

  journal:
    deleteBeforeStart: true

  ## Additional server files will be deployed to /etc/graylog/server
  ## For example, you can put server certificates or authorized clients certificates here
  ##
  serverFiles:
     server.key: |
     server.cert: |

Deploy graylog


helm upgrade --install --namespace $namespace  graylog stable/graylog \
  --set tags.install-mongodb=false\
  --set tags.install-elasticsearch=false\
  --set graylog.mongodb.uri=mongodb://mongodb-mongodb-replicaset-0.mongodb-mongodb-replicaset.$namespace.svc.cluster.local:27017/graylog?replicaSet=rs0 \
  --set graylog.elasticsearch.hosts=http://elasticsearch-client.$namespace.svc.cluster.local:9200 \
  -f graylog.yaml

A couple of points here,

Make sure you change the graylog.yaml file with your own domain name, SSL secret key etc.
Every setup may vary Java memory usage, and container CPU/LIMIT. I would suggest that you start with some lower threashold and if you start seeing OOMkill or higher load due to large amount of data processing, you may want to consider upgrading the JAVA JVM memory limit and container CPU/RAM.

Wait for a few minutes for the containers to come up

kubectl get pods -w

Setup DNS

While the containers are being created, setup a DNS zone “A” record for your domain pointing to the private load balancer IP that you specified above.

As soon as the containers are up, Nginx Ingress controller will pick up the change and reconfigure the virtual host automatically and you will have a working graylog setup

Graylog Input

graylog.yaml contains the configuration for adding/editing graylog Inputs. In my setup, I have only created 1 Input (UDP port 30010)

  input:
    udp:
      service:
        type: NodePort
      ports:
        - name: getlf-30010-udp
          port: 30010
          nodePort: 30010

You can add additional graylog inputs. For TCP, change the input type to “tcp”. For a list of additional configuration please check https://github.com/helm/charts/tree/master/stable/graylog

Point LB to graylog input port

The graylog inputs are created as a kubernetes service in your cluster with the ports exposed as NodePort. So you will have to configure the load balancer to forward the traffic to the correct port. Hence we create a rule for it

# Make sure to point to the correct resource group and location
az network lb rule create -g mc_devops-tools--rg_devops-tools_eastus \
    --lb-name kubernetes-internal \
    --name udp30010 \
    --protocol Udp \
    --frontend-port 30010 \
    --backend-port 30010 \
    --backend-pool-name Kubernetes

And that’s it. Now you should have a working graylog instance.

Questions

How do I upgrade Graylog version?
- Edit graylog.yaml and update the “repository” version. Eg, repository: “graylog/graylog:3.3” => repository: “graylog/graylog:3.4”
- Re-run the “helm upgrade –install $namespace graylog” command provided above.
- This will kill existing pods and start new pods with the latest release
How do I upgrade elasticsearch Or MongoDB?
- The process is the same. Edit the yaml file and re-run. Note that Elasticsearch 7 is not yet supported by Graylog at the time of this writing, so if you are thinking about upgrading elasticsearch to 7 don’t do it now.
How do I scale up elasticsearch data volume size?
- This would require a bit more explanation, read more below

Scaling up Volume for Elasticseach

In this case, we created 3 statefulsets. Statefulset comes with some limitations. One of them is that you cannot resize the underlying PV/PVC while the pods are running. So if you want to upgrade the volume for data pod of elasticsearch you will have to shutdown the statefulset (or scale replicas to 0) and then scale up the PVC

Warning – Shutting down elasticsearch data volume may cause downtime to graylog log writing (even though graylog would itself will be up). However, as I understand, you will not lose data as graylog would save the incoming logs in its queue for some time.

# Step 1 - Shutdown the elasticsearch data scaleset
kubectl scale sts elasticsearch-data --replicas=0

# Wait for the scale set to show 0 pods
kubectl get sts

# Now, scale up the Physical Volume Claim
kubectl get pvc

kubectl edit pvc elasticsearch-data-elasticsearch-data-0

# look for resources.requests.storage and update it with the new storage limit. Note that you cannot scale down, only scale up. Update and save

# Repeat the same for the other data pvcs
kubectl edit pvc elasticsearch-data-elasticsearch-data-1
kubectl edit pvc elasticsearch-data-elasticsearch-data-2

#Use the following command to view the log of the resizing
kubectl describe pvc elasticsearch-data-elasticsearch-data-0

#It should say waiting for the pod to boot up to resize. 
# Open Azure Portal and goto the k8s resource group. Verify that the physical volumes are resized. 

#Now scale the sts back 

kubectl scale sts elasticsearch-data --replicas=3

Github

All the scripts that I’ve used can be found at https://github.com/getvivekv/graylog-ha-azure-aks