Understanding and using the Kubernetes PodNodeSelector Admission Controller
Thanks to Timo Reimann, Bjoern Brundert and Tom Schwaller for their feedback and improving this post.
During a recent customer engagement, a discussion about Kubernetes NodeSelectors
came up. There was some confusion about whether and how to use them for a multi-tenant cluster deployment. In the end, we decided to leverage the Kubernetes PodNodeSelector
admission controller. Since not all implementation details were clear to me, I did some experiments and wanted to share my findings with you.
About NodeSelectors
Using NodeSelectors
in Kubernetes is a common practice to influence scheduling decisions, which determine on which node (or group of nodes) a pod should be run. NodeSelectors
are based on key-value pairs as labels. Common use cases include:
- Dedicate nodes to certain teams or customers (multi-tenancy)
- Distinguish between different types of nodes (“expensive” nodes with specialized hardware, e.g. GPUs and FPGAs, or resources, ephemeral “spot” instances)
- Define topologies for rack/zone/region awareness and high-availability
You can read more about NodeSelectors
and other options in Assigning Pods to Nodes.
Given these use cases are often requirements in enterprises, this topic comes up in almost every architecture workshop or design review I conduct with customers deploying Kubernetes on VMware vSphere. VMware vSphere VMs can be of any size and may have different quality of service (QoS) policies, for example resource guarantees or limits (cpu, memory, disk, network) applied.
Just like with Kubernetes resource management, it is a service level agreement (SLA) and contract the infrastructure provider agrees to with the consumer. Thus NodeSelectors
play a critical role to ensure that, for example, production workloads (Kubernetes deployments, or pods to be precise) run on VMs with QoS == “production”. Otherwise we would be at risk breaking the contract. As a best practice, NodeSelectors
are typically set during CI/CD pipeline stages without human intervention.
NOTE
You might be wondering why I am not discussing related concepts, like node (anti-) affinity
or taints and tolerations
. Please see the section “Roadmap and alternative Approaches” at the end of this post.
Admission Controllers
Quoting from the Kubernetes documentation:
An admission controller is a piece of code that intercepts requests to the Kubernetes API server prior to persistence of the object, but after the request is authenticated and authorized. The controllers […] are compiled into the kube-apiserver binary, and may only be configured by the cluster administrator. […] If any of the controllers in either phase reject the request, the entire request is rejected immediately and an error is returned to the end-user.
There are many admission controllers in the Kubernetes core, for example LimitRanger
or PodPreset
. A complete list is provided here.
The Use Case for PodNodeSelector
The PodNodeSelector
admission controller has been in the source for a while, based on an upstream contribution from Red Hat (Openshift’s project node selectors). Quoting again from the Kubernetes documentation:
This admission controller defaults and limits what node selectors may be used within a namespace by reading a namespace annotation and a global configuration.
Ok, so why would I use that one instead of simply specifying NodeSelectors
in the pod manifest? Well, using this admission controller has some advantages:
- As the name implies, it enforces node selectors at admission time before creating the pod object
- This puts less pressure on the master components, for example the scheduler; if there is no matching namespace or a conflict, the pod object will not be created
- You can define a default selector for namespaces that have no label selector specified
- Ultimately, it gives the cluster administrator more control over the node selection process (hard enforcement)
The last one is very important if you have large distributed teams or different departments within the organization. I have seen customer environments, where there was no communication between infrastructure operations, the Kubernetes platform engineers and the development teams. The lack of alignment led to performance issues and, sometimes even worse, production outages. Simply speaking, every team had a different understanding and definition of “production”.
Translated into an enterprise deployment, the following (highly abstracted) picture tries to make it clear. Here, a kubectl
user is only allowed to deploy nginx pods into the namespace “development”, enforced by Kubernetes role-based access controls (RBAC). However, if she doesn’t specify a NodeSelector
, this pod could end up being scheduled on the production VMs. The PodNodeSelector
automatically injects a NodeSelector
, based on a default policy.
On the other side, the CI/CD pipeline has a workflow defined to deploy mission-critical nginx proxies into the namespace “production”. Those should always land on a production VM, because these have a higher QoS policy applied by the VMware vSphere cluster administrator (remember the “contract” we spoke about earlier?).
NOTE
I am working on a follow-up article describing distributed resource management and QoS in the enterprise. A fascinating topic if you ask me :)
But using admission controllers, in their current implementation, can also have downsides, most notably:
- They need to be compiled into the Kubernetes API server
- A change in configuration requires a restart of the API server
- If you use a managed Kubernetes environment, like Google Kubernetes Engine (GKE), configuration options might be limited
This is why Dynamic Admission Control has been added to Kubernetes recently.
Back to our admission controller. Now that we understand the pros and cons, let’s deploy it and see it in action.
Setting Things up
My demo environment is based on a three node kubeadm
deployment, running in VMware Fusion:
- Kubernetes version: 1.9.2
- 1x “master” VM: Just system pods allowed, i.e. core Kubernetes services
- 1x “production” VM: Only run pods from namespace “production”
- 1x “development” VM: Run any pod from any namespaces, except for “production”
- Kubernetes
PodNodeSelector
admission controller enabled in the API server
Terms used in the Examples
To better understand the examples, let me first explain some terminology used in the PodNodeSelector
:
clusterDefaultNodeSelector
(global) - Unless awhitelist
has been specified, add thisNodeSelector
to all pods (applies to all namespaces without awhitelist
specified)Whitelist
(per namespace) - Only allow thisNodeSelector
(requires a matching namespaceannotation
)- Namespace
annotation
- Key/value Metadata attached to namespaces; the differences betweenannotations
andlabels
are described here
Preparing the API Server
Assuming your cluster has been deployed with kubeadm
, the kube-apiserver.yaml
file is the configuration manifest for the Kubernetes API server. It is located in /etc/kubernetes/manifests
. Add the PodNodeSelector
admission controller in the --admission-control=
flag.
NOTE
Strictly speaking, the plugin order matters. We will ignore this in our demo example. You can read more about it in Using Admission Controllers.
Also add --admission-control-config-file
, volumeMounts
and volumes
sections as depicted below. They relate to the default selectors and whitelist, which we will create in a second.
|
|
Create the file admission-control.yaml
in /etc/kubernetes/adm-control/
.
|
|
This file defines the default NodeSelector
for the cluster, as well as whitelist for each namespace. Note the following:
- Every pod created in the “production” namespace will be injected (i.e. get assigned) the
NodeSelector
“env=production” - Every pod in the “development” namespace will inherit the
clusterDefaultNodeSelector
, in this case “env=development”- Effectively, you do not have to mention namespaces with an empty whitelist in this file
- The “noannotation” namespace is to demonstrate what happens when there is no matching
annotation
in the namespace properties but a whitelist has been specified
NOTE
For whitelists to work, every namespace has to have a matching annotation
(this is not a label!). “noannotation” shows what happens when there is no (matching) annotation.
Prepare the Nodes (kubelets)
Apply labels to the nodes, matching the whitelist NodeSelectors
above.
|
|
Now, the cluster looks like this:
|
|
Prepare the Namespaces
|
|
This is how namespace “production” looks like now:
|
|
If you also wonder about the alpha
in the annotations
spec, there is an issue discussing it.
Ready to go!
Deploy some Pods
|
|
What Kubernetes created for us:
|
|
First, let’s examine each pod in “production” and “development” (the pod in “default” is equal to the “development” case):
|
|
But where’s the nginx pod from the “noannotation” deployment? Let’s ask Kubernetes:
|
|
As you can see, no pod is being created due to the admission controller. We intentionally triggered this since we did not specify a correct namespace annotation
matching the whitelist
(as we did with the “production” namespace).
Deploy a Pod with a conflicting NodeSelector
Now let’s try to create a pod in the “production” namespace, but specify a conflicting NodeSelector
in the pod manifest. This could happen for various reasons: A typo, a greedy user, not cleaning up pod manifests after changing labels, etc.
|
|
|
|
This is nice. The admission controller checks for conflicts. And since the “nginx2” pod is supposed to be deployed in the “production” namespace, we only allow label selectors listed in the whitespace section of admission-control.yaml
.
Some suggestions
To keep things simple and understandable, I usually follow these best practices when using this admission controller:
- Come up with a good node labelling scheme (sounds easy, but can be challenging at times)
- Kubelet’s
--node-labels
argument can be very useful (warning: this is an alpha feature)
- Kubelet’s
- Use a
clusterDefaultNodeSelector
defaulting to nodes != “production”, which has the following effect:- Namespaces with no
annotations
and no whitelists specified will inherit this setting - Additional
NodeSelectors
are allowed in the pod manifest
- Namespaces with no
- For mission-critical namespaces specify
annotations
and whitelists- Pods deployed therein will have
NodeSelectors
strictly checked and enforced, that is no pod manifest conflicts/drift allowed
- Pods deployed therein will have
- Use role-based access controls to secure your namespaces
Roadmap and alternative Approaches
You might be wondering why I am not touching on node (anti-) affinity
, external admission webhooks
or taints and tolerations
. The latter would be appropriate as well, but one would have to write a custom admission controller.
node (anti-) affinity
(currently in beta) is supposed to become the successor of NodeSelectors
, with richer query logic. When they graduate from beta
to stable
, NodeSelectors
eventually will be deprecated.
For external admission webhooks
, as part of dynamic admission control
, it’s still early days and for some simple use cases they might be overkill. But this is just my opinion at the time of writing this article. Kubernetes is moving fast and the amazing community quickly resolving gaps and issues.
So why would you use PodNodeSelector
admission controller today? Here’s my point of view:
- So far,
NodeSelectors
and as such the corresponding admission controller have not been marked deprecated - The implementation and configuration is straight forward, robust and easy to understand
- At the time of writing, there is no admission controller for
node (anti-) affinity
, so you lose the advantages of an admission controller described in this article- I am not the only one interested in fixing this: https://github.com/kubernetes/kubernetes/issues/58198
- Of course, you could write your own admission controller/webhook
Thanks for taking the time to read this post. I hope you found it interesting and it at least answered some questions. As always, feel free to share it on social media and reach out to me on Twitter for any feedback, comments or corrections.