What is a Kubernetes estate audit?

A Kubernetes estate audit is a structured review of an existing Kubernetes platform that you did not build, covering cluster topology, operational maturity, security posture, cost structure, and the team's ability to operate it. It is usually run when a new engineering leader takes over, after an acquisition, or before a major investment decision.

What are the biggest risks in an inherited Kubernetes platform?

The biggest risks are usually invisible from the outside: undocumented exceptions baked into long-lived clusters, single points of operational knowledge in one or two people, security misconfigurations that have been normalised, and recurring costs from workloads nobody owns. These compound and are expensive to unwind.

How long does a Kubernetes estate review take?

A focused estate review typically takes two to four weeks depending on the size of the estate. The goal is not to document everything; it is to produce a defensible view of risk, cost, and operability so that the new leader can make decisions with confidence.

When should a CTO commission an external Kubernetes review?

External reviews are most valuable in three situations: shortly after a new engineering leader takes over an existing platform, immediately after acquiring or being acquired by another organisation, and before making a major investment decision (migration, consolidation, replatforming) where the cost of being wrong is significant.

What to Check Before You Trust an Inherited Kubernetes Estate

On this page

Why Inherited Platforms Are Different
What an Honest Inherited Estate Review Looks At
Red Flags Worth Slowing Down For
What You Are Producing, And Why It Matters
The Takeaway

You have just taken over a Kubernetes platform you did not build.

Maybe you are a new CTO. Maybe the company was acquired. Maybe the engineering leader who built it left, and the estate has been quietly running without an owner for six months.

Either way, you are now responsible for it. And before you sign off on its risk, commit to its roadmap, or quote its costs to the board, you need to know what you have actually inherited.

Most inherited platforms look fine from the outside. The dashboards are green. The deployments work. Someone, somewhere, has been keeping it running.

That is not enough information to trust it.

Why Inherited Platforms Are Different

A Kubernetes platform you built yourself is an artefact you understand: you know why the choices were made, which compromises were intentional, and where the bodies are buried. You can reason about its limits.

A platform you inherited is the opposite. Every configuration is the result of a decision you weren’t part of. Every exception was made for a reason you don’t know. Every workaround that became permanent did so because someone, at some point, ran out of time to fix it properly.

Inherited platforms have three properties that make them risky:

The original decision context is lost. Why is the production cluster running an old Kubernetes version? Possibly because the team is behind on upgrades. Possibly because a critical workload won’t certify against newer versions. You cannot tell from the cluster.
Tribal knowledge is concentrated in a few people. Usually the engineers who built the platform have moved on, leaving one or two who remember why things were done a certain way. When they leave, the institutional understanding goes with them.
Cost and risk are not where they appear. The expensive parts of an inherited estate are usually not the parts the previous team flagged. They are the parts that were normalised so thoroughly that nobody thinks of them as problems.

This is why “the platform works” is not a defensible answer. The right question is whether you understand it well enough to commit to its future.

What an Honest Inherited Estate Review Looks At

There are six areas worth examining before you trust an inherited platform. None of them are about Kubernetes specifically. They are about whether the operational reality matches the apparent one.

1. Cluster topology and justification

For each cluster in the estate, you need to answer two questions: what is it for, and why is it a separate cluster?

If the answer to the second question is unclear, vague, or “historical reasons”, you have inherited cluster sprawl. This is one of the most common findings in inherited estates and one of the most expensive ones to unwind, because every cluster carries fixed operational overhead regardless of what runs on it.

A useful test: if half the clusters in the estate were consolidated into the other half, what would break? If the answer is “nothing technical, but the teams won’t agree”, the cluster boundary is political, not architectural. That is fixable, but you need to know it exists.

2. Upgrade posture

How current is each cluster, and how current does it stay?

Kubernetes ships three minor versions a year. Every version you fall behind is operational debt: increasing CVE exposure, narrowing third-party compatibility, and eventually moving the estate into extended support pricing.

Inherited estates often have clusters that have been “about to be upgraded” for two or three releases. The reason is rarely technical. It is usually that the upgrade requires coordination with application teams who would rather not, and nobody owns forcing the issue.

If the estate has any cluster more than two minor versions behind, that is a signal worth investigating. The reason it is behind tells you more about the operating model than about the cluster.

3. Workload ownership

For every workload running in the estate, who owns it?

In healthy platforms, every workload has a named owning team, a runbook, and a clear path for the platform team to escalate to. In inherited platforms, this is rarely true. There are usually:

Workloads where the owning team has dissolved or restructured.
Proof-of-concepts that drifted into production without anyone formally accepting them.
Workloads that the platform team is operating because nobody else will.
Workloads that everyone assumes someone else owns.

Each of these is a recurring cost and a recurring risk. They are expensive to find but cheap to fix once found, which is why this is one of the highest-leverage things to audit early.

4. Security posture, as it actually is

Every Kubernetes platform has a security model on paper. The question is whether the platform actually behaves the way the document says it does.

A useful starting set:

Are RBAC permissions reviewed, or have they accreted over years of ad hoc requests?
Are admission controllers actually enforcing policy, or have they been silently bypassed for specific workloads?
Are secrets managed through a defined system, or are some still living in ConfigMaps and Helm values from years ago?
Are nodes hardened against common compromise paths, or is the baseline still the original installer defaults?

Inherited estates almost always show drift between the documented security model and the operational one. The question is how much, and where.

5. Cost structure and ownership

Most inherited estates have at least one source of cost that nobody can explain. Sometimes it is a development environment that was never decommissioned. Sometimes it is overprovisioned production capacity from a long-past traffic peak. Sometimes it is an observability stack ingesting telemetry from workloads that were retired.

The early-stage question is not “how do we reduce cost?”. It is “do we know what we are paying for?”. If the cost cannot be attributed cleanly to teams, workloads, or environments, the estate has a visibility problem that will block every future optimisation.

6. The operating team

The most important part of the audit is also the most often skipped: what is the actual operational state of the team running the platform?

Is on-call sustainable, or is it concentrated on two people?
Is the team primarily building, or primarily responding?
Are critical pieces of operational knowledge documented, or held in one person’s head?
Has the team been given a clear mandate and the authority to enforce it, or are they negotiating every change individually?

A healthy estate operated by an unhealthy team will become an unhealthy estate within six to twelve months. This is the part of the audit most likely to produce uncomfortable findings, and the part most likely to determine the trajectory.

Red Flags Worth Slowing Down For

Some findings should change the conversation. If your inherited estate review surfaces any of these, treat them as a signal to slow down and look harder before committing to a roadmap:

Single points of operational knowledge. One engineer who knows how the platform actually works. Their leaving is a near-term business risk.
Clusters that nobody is willing to upgrade. The reason is usually that the upgrade exposes a workload incompatibility that has been deferred. The cost of finding out unexpectedly is much higher than the cost of investigating now.
A platform team operating beyond capacity. Not “stretched”, actually beyond capacity. This is a precursor to attrition, missed maintenance, and the kinds of incidents that show up at the worst possible time.
Security exceptions that have become permanent. A bypass that was granted once for a deadline, never revoked. These accumulate, and the audit trail is rarely complete.
Costs that nobody can attribute. Spend that exists, that someone is paying for, but that nobody can explain.

Each of these is recoverable, but each one represents work that should be done before, not after, the platform is treated as load-bearing for new commitments.

What You Are Producing, And Why It Matters

The output of an inherited estate review is not a list of problems. It is a defensible view of three things:

What you have inherited, in operational terms a non-technical executive can understand.
What it would take to operate it confidently, including the work that has been deferred.
What needs to be true before the platform can support whatever you are trying to do next.

That third one is what the review is really for. Most inherited platforms are committed to a roadmap before they have been assessed. The roadmap then runs into reality, and the platform team is blamed for missing dates that were never realistic given the state of the estate.

An honest assessment up front prevents that. It also gives the new engineering leader a defensible position: a clear-eyed view of what they have taken on, what it will cost, and what they are choosing to fix first.

The Takeaway

If you have just inherited a Kubernetes estate, the most expensive thing you can do is treat the green dashboards as evidence that the platform is sound.

Healthy-looking platforms are routinely hiding undocumented exceptions, concentrated operational knowledge, and costs nobody can explain. None of these are visible from the outside. All of them compound over time, and the cost of finding them late is much higher than the cost of finding them early.

The first three months of a new engineering leadership tenure - or the first three months after an acquisition - are the cheapest possible time to commission an honest review. After that, the new team has implicitly endorsed the state of the platform, and changing that view becomes politically expensive.

If you are in that window now, we can help you take a proper look.

Why Inherited Platforms Are Different

What an Honest Inherited Estate Review Looks At

1. Cluster topology and justification

2. Upgrade posture

3. Workload ownership

4. Security posture, as it actually is

5. Cost structure and ownership

6. The operating team

Red Flags Worth Slowing Down For

What You Are Producing, And Why It Matters

The Takeaway

Frequently Asked Questions

Continue reading

The CTO's AI Platform Inventory: 8 Questions You Should Be Able to Answer in 30 Seconds

Why FinOps Just Became an Engineering Leader Problem

Digital Sovereignty Is Now a Platform Engineering Problem