Troubleshooting CoreDNS in OpenShift 4.x

From using SRV records for locating available services to providing a methods for stable load balancing between Pods of a StatefulSet using a headless-service, DNS is a fundamental component for service discovery within Kubernetes. CoreDNS has become the standard DNS server used with Kubernetes and is the DNS provider within OpenShift / OKD 4.x deployments, succeeding the previously-used Sky-DNS services.

In OpenShift, CoreDNS is managed by an Operator located within the openshift-dns-operator Namespace. The Cluster DNS Operator (CDO) watches for changes to the dnses.operator.openshift.io custom resources (CR) and published a dns-default DaemonSet into the openshift-dns Namespace with the intent of running CoreDNS on all Nodes.

For more information about the configurable options for this CR and Operator, the following command can be used:

$ kubectl explain --api-version operator.openshift.io/v1 dnses.spec

As the CDO manages the DNS deployment, this can make it difficult to collect information from the running DNS servers as the default logging configuration for OpenShift is limited.

At this stage, to increase the logging functionality for the DNS deployment the Operator must be paused to allow for custom changed to be made to the governing Corefile.

Setting up for Changes

To achieve this we must configure the ClusterVersionOperator (CVO) to exclude CDO from the reconciliation loop, otherwise when we disable the CDO it will be re-enabled on the next CVO reconcile loop.

Using the following command we can edit the ClusterVersion definition to ensure that the override defined below is added to the ClusterVersion CR.

$ kubectl edit clusterversion version

spec:
  overrides:
  - group: apps
    kind: Deployment
    name: dns-operator
    namespace: openshift-dns-operator
    unmanaged: true

Now we have set the ClusterDNSOperator to be unmanaged, we can stop the DNS operator.

The following can be used to scale the DNS operator:

$ kubectl scale -n openshift-dns-operator deployment/dns-operator --replicas=0
deployment.apps/dns-operator scaled

$ kubectl get pods -n openshift-dns-operator
No resources found in openshift-dns-operator namespace.

With the DNS Operator disabled, we can make arbitrary changes to the Corefile and restart the CoreDNS instance. Review the list of available CoreDNS Plugins to see what can be configured for troubleshooting CoreDNS.

Log all Requests:

We can add additional logging to the CoreDNS instances by using the log Plugin as below:

.:5353 {
    bufsize 512
    # Adding the log line below with result in all requests to the DNS Server
    # will be logged to standard out
    log

    errors
    health {
        lameduck 20s
    }
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
    }
    prometheus 127.0.0.1:9153
    forward . /etc/resolv.conf {
        policy sequential
    }
    cache 900 {
        denial 9984 30
    }
    reload
}

Cleanup

NOTE: The DNS Pods must be restarted for the Pods to update from the ConfigMap changes:

$ oc delete pods --all -n openshift-dns

To revert any changes performed to the DNS Deployment or ConfigMaps, reset the ClusterVersionOperator to manage the openshift-dns-operator and restart the Pods: kubectl edit clusterversion version

spec: {}

NOTE:

all items discussed in this article have been tested on OpenShift 4.9, but should be similar for most version of the OpenShift / OKD 4.x versions.
All uses of kubectl can be replaced with oc and this will achieve the same outcome