Node Operation Controller
This is a Kubernetes controller for an automated Node operation. In general, if we perform a Node operation that affects running Pods, we need to do the following steps:
- Make the Node unschedulable.
- Evict running Pods in the Node and wait all running node to be evicted.
- Perform the operation.
- Make the Node schedulable.
Node operation controller automates these steps. In addition, this controller:
- watches NodeCondition and perform an arbitrary operation
- takes care count of unavailable Nodes due to the operation
Table of contents
How it works
NodeOperation and NodeDisruptionBudget
- When NodeOperation resource is created, go to next step
- Confirm the NodeOperation does not violate NodeDisruptionBudgets.
- If it violates NodeDisruptionBudgets, wait for other NodeOperations to finish.
- Taint the target Node specified in NodeOperation.
- The Taint is
nodeops.k8s.preferred.jp/operating=:NoSchedule
- Evict all running Pods in the Node.
- By default, this uses Pod eviction API. You can control eviction by NodeDisruptionBudget.
- This behavior can be configured by
evictionStrategy
option of NodeOperation.
- After eviction, run a Job configured in the NodeOperation
- The Pod created by the Job has
nodeops.k8s.preferred.jp/nodename
annotation which indicates the target Node.
- Wait the Job to be in Completed or Failed phase.
- Untaint the Node.

For most operation team, they would have their own secret-sauce for daily operation. This means typical node failure can be cured by common recipe shared among the team. NodeRemediation
, NodeRemediationTemplate
and NodeOperationTemplate
enable us to automate the common operation for known node issues.
NodeOperationTemplate
represents a template of common node operation.
NodeRemediation
defines
- target node to apply the remediation,
- known failure by Node
conditions
, and
- corresponding
nodeOperationTemplate
to fix the failure.
NodeRemediationTemplate
defines
- target nodes to apply the remediation by
nodeSelector
and
- a template of
NodeRemediation
.
Node operation controller watches nodes and if it detected the failure matches some NodeRemediation
, then it creates NodeOperation
from specified NodeOperationTemplate
automatically.

Custom Resources
NodeOperation
apiVersion: nodeops.k8s.preferred.jp/v1alpha1
kind: NodeOperation
metadata:
name: example
spec:
nodeName: "<operation target node>"
jobTemplate:
metadata:
namespace: default
spec: # batchv1.JobSpec
template:
spec:
containers:
- name: operation
image: busybox
command: ["sh", "-c", "echo Do some operation for $TARGET_NODE && sleep 60 && echo Done"]
env:
- name: TARGET_NODE
valueFrom:
fieldRef:
fieldPath: "metadata.annotations['nodeops.k8s.preferred.jp/nodename']"
restartPolicy: Never
evictionStrategy: Evict # optional
nodeDisruptionBudgetSelector: {} # optional
skipWaitingForEviction: false # optional
evictionStrategy
This controller has some ways to evict Pods:
evictionStrategy: Evict
: This strategy tries to evict Pods by Pod eviction API and it respects PodDisruptionBudget.
evictionStrategy: Delete
: This strategy tries to evict Pods by deleting Pods.
evictionStrategy: ForceDelete
: This strategy tries to evict Pods by deleting Pods forcibly.
evictionStrategy: None
: This strategy does not evict Pods and it just waits all Pods to finish.
nodeDisruptionBudgetSelector
By default, a NodeOperation respects all NodeDisruptionBudgets (NDB) but in some cases, some NDBs need to be ignored. (e.g. urgent operations)
If nodeDisruptionBudgetSelector is set, only NDBs whose labels match the nodeDisruptionBudgetSelector will be respected.
skipWaitingForEviction
By default, a NodeOperation waits for all pods drained by the eviction.
If skipWaitingForEviction is true, a NodeOperation skips waiting for the eviction finishing. It means that a NodeOperation ignores not drained pods.
NodeDisruptionBudget
apiVersion: nodeops.k8s.preferred.jp/v1alpha1
kind: NodeDisruptionBudget
metadata:
name: example
spec:
selector: # nodeSelector for Nodes that this NodeDisruptionBudget affects
nodeLabelKey: nodeLabelValue
maxUnavailable: 1 # optional
minAvailable: 1 # optional
taintTargets: [] # optional
maxUnavailable
and minAvailable
minAvailable
: minimum number of available Nodes
maxAvailable
: maximum number of unavailable Nodes
taintTargets
By default, this controller treats Nodes with a specific taint as "unavailable". The taint is nodeops.k8s.preferred.jp/operating=:NoSchedule
and it is added to Nodes during this controller is processing NodeOperations.
In addition to the default taint, Nodes with taints which match taintTargets
are "unavailable".
taintTargets:
- key: 'k1'
operator: 'Equal'
value: 'v1'
effect: 'NoSchedule'
For instance, if the above taintTargets
are set, Nodes with k1=v1:NoSchedule
taint are "unavailable".
A NodeRemediation watches condition of a Node and it creates a NodeOperation from a NodeOperationTemplate to remediate the condition.
apiVersion: nodeops.k8s.preferred.jp/v1alpha1
kind: NodeOperationTemplate
metadata:
name: optemplate1
spec:
template:
metadata: {}
spec: # NodeOperationSpec
job:
metadata:
namespace: default
spec: # batchv1.JobSpec
template:
spec:
containers:
- name: operation
image: busybox
command: ["echo", "Do some operation here"]
restartPolicy: Never
apiVersion: nodeops.k8s.preferred.jp/v1alpha1
kind: NodeRemediation
metadata:
name: remediation1
spec:
nodeName: node1
nodeOperationTemplateName: 'optemplate1'
rule:
conditions:
- type: PIDPressure
status: "True"
- type: OtherCondition
status: "Unknown"
A NodeRemediationTemplate creates NodeRemediations for each Nodes filtered by nodeSelector.
apiVersion: nodeops.k8s.preferred.jp/v1alpha1
kind: NodeRemediationTemplate
metadata:
name: remediationtemplate1
spec:
nodeSelector:
'kubernetes.io/os': 'linux'
template:
spec:
nodeOperationTemplateName: 'optemplate1'
rule:
conditions:
- type: PIDPressure
status: "True"
- type: OtherCondition
status: "Unknown"
How to Release
The release process is fully automated by tagpr. To release, just merge the latest release PR.