Cloud Native Maturity Model – The Five Aspects of Cloud Native Maturity

Aspects: Business Outcomes

Mon, 01 Jan 0001 00:00:00 +0000

Cloud Native Maturity Model Business Outcomes

Deciding to adopt a cloud native approach for your application or services is usually driven by business reasons. Whether your business wants to scale to millions of users and you need the scalable infrastructure to support or your product team needs to ship features to market faster, cloud native can support.

Just as you will have addressed your journey across people, process, policy and technology in support of cloud native, you also need the ability to clearly communicate the value your business should expect. The CXO board and/or business leadership should all be in receipt of that communication and understand the value. This requires you to align your business goals to how cloud native will help you achieve those. Some examples may include:

Scale to 1 million users: Provide flexible, scalable infrastructure based on users at any given time equipped with fast failover in the event of a problem.
Deliver exceptional customer experience: Ensure the app is reliable as to not frustrate users
Get features to market faster: Enable a microservices approach to building apps. Smaller teams are more agile because each team has a focused function. APIs minimize the amount of cross-team communication required to build and deploy.

It is important to accept that your cloud native maturity is not a linear evolution and there will be setbacks. Remember to communicate this clearly. For example, you may have put the infrastructure in place to ensure a reliable application for your customers, but due to a misconfiguration in code, the outcome is delayed. As with all development, communicate the pros and cons to stakeholders so they understand before passing judgment on cloud native.

Level 1: Build

Level 1 of the Cloud Native Maturity Model is where your team has a baseline implementation in place and you are in pre-production. Here you will have completed a successful POC. Based on the POC, you should have initial findings on how cloud native will help improve your app. In a dev environment, you could, for example, have seen that:

An app is using less resources (cost savings / more efficient use)
A new feature shipped faster (faster time to market and thus increased revenue)
There was no downtime (improved reliability for customers)
Improved business continuity thanks to resilient cloud architectures

These are just examples, they are not a guarantee based on your environment as results may vary.

In this phase, you will determine how you’ll measure (your initial KPIs) the success of your cloud native journey; and just as important, how you will demonstrate it to stakeholders. This is a major outcome of Level 1 as the entire success of the journey should be mapped to this measurement. Remember it won’t be immediate on day 1. Some quantitative and qualitative example KPIs may include:

Reduced spend on app infrastructure by 25% by optimizing for cost
Dev cost lowered by 10%
Reduced team focus on app infrastructure by 15% by automating as much as possible
Increased security for the application by automating CVE identification in containers
Improve compliance as you can restrict and track access to the application; demonstrate compliance with SOC 2
Accelerated development life cycles as you implement CI/CD pipelines shipping 10% more features per quarter
Migrate plan - this will vary depending on your organization, but you should have a migration plan in place. Whether that’s to migrate one application first, or several, you should have this established.
Improved customer experience measured by increased performance scores
Elimination of information silos: departments no longer isolated; unique, integrated ecosystem in place.
Alignment of business and IT goals: everyone is involved and aware, so that resources are better addressed to meet those goals efficiently.
Increased internal communication: cross-pollination offers new perspectives with shared knowledge.

In this phase, it’s important that the business outcomes are examined and explained to business stakeholders. It should be a discussion with engineering leadership, the application owner (finance, marketing, etc), the CEO, and even the board. Without these discussions and alignment, maturing to the next phases will come with little appreciation and possibly even skepticism.

Level 2: Operate

Cloud native is now established and your technologists are moving to production. While the technical outcome of Level 2 is a fully functional application or group of applications migrated to cloud native tools and practices, the business outcomes are the ability to evaluate the benefits of the migrations. This is also the level that most customers/corporations get to and plateau. This is when a cloud native maturity model shows its true value.

With your established KPIs from Level 1, you will measure success and communicate this to stakeholders.

In the operation phase, you will be focused on moving to production. You’ll have established standards around technology, your people will be operating it and implementing policy and process. Your business outcome will be around production migration. The business leadership of your organization will want to understand what applications are being moved and why. Be able to clearly communicate the plans to your business leaders. Repeatable patterns will also emerge as teams operate in Level 2. These will be applied to your business outcomes so that benefits you see in one migrated application can be applied to another without as much as a heavy lift. These patterns will help streamline operations across your dev, sec and ops teams.

Your KPIs can also include your return on investment ROI, but know that in Level 2, your ROI will be lower than when you reach Level 5. This is because you are investing a lot in acquiring tools, establishing the right team and skill set, whereas in Level 5 you are optimizing.

Level 3: Scale

In Level 3, your competency is growing and you are scaling. Up to this point, your teams have been focusing on learning cloud native. In this stage, your business outcomes are dependent on your team’s experience. As the team builds confidence, their competency around security, efficiency and reliability grows and they will implement defined processes for scale. All of these will impact your services and applications as the team improves. Your business should start to notice operations are more scalable and if not you will need to improve lines of communication to demonstrate this scale, or review the actual scaling results, so they can be optimized further.

You will have safeguarded your application or service from a single point of failure or disappointing performance in Level 3.

Monitoring is implemented. This will help the business get reports on what’s working and what isn’t working. While the monitoring may be very specific, it will also provide insights into resource utilization to control costs and performance to ensure availability.

Finally, you should be observing the flexibility and scalability of cloud native by comparing old vs. new:

Deploying a server takes minutes with Infrastructure as Code instead of days. Business translation: faster time to market.
Monitoring for security attacks. Business translation: less risk, stolen data.
Observability: Logging, metrics and tracing. Business translation: quicker responsiveness to changes in application behavior or business demand. Better customer experience and reduction in lost sales due to service degradation.
Improved Reusability: containers and microservices make it easier to reuse components already available from previous projects. Business translation: 1. guarantee of brand image consistency and standardized functionalities throughout the multiple apps; 2. a lower learning curve for customers using those apps.

Level 4: Improve

Level 4 is focused on improvements around security, policy and governance across your environment. The team can focus more of their time on your business instead of maintaining Kubernetes. Level 4 is also the next level where clients and customers plateau. And most customers can stay at this level as they further mature.

Your team has cloud native confidence and now it’s time to take that knowledge and apply it more thoroughly to your business goals.You have continued to measure yourself against established KPIs in Level 1 and provided those to the business. You’ll have alignment on goals because you can demonstrate outcomes. The business should expect to see:

Established protocols and procedures
Policy enforcement of compliance standards
Comparison of cloud native apps vs. non-cloud native

The business should expect more reporting in this phase. Reporting should cover compliance, security, performance and cost. These should be easily aligned to the business goals established in Level 1.

At this point, you may start to migrate your other applications and have a better understanding of what you want to achieve and where you will see value during each level of maturity.

Level 5: Adapt

This phase of optimization will see lots of changes with people, process, policy and technology. For the business, you should have achieved your business goals and have the measurable results to show your leadership teams, CEO, CFO or the board.

You will continue to optimize your workloads against further / more advanced cost and performance metrics. You will never stop optimizing your cloud native infrastructure and apps. Here the expected business outcome is the ability to track how optimization continues to move the bar against established goals.

You may also revisit your goals at this point, adjusting them to what has been achieved and what you want to achieve in future.

You’ll automate as much as possible according to cloud native best practices to remove human error as to avoid security and performance problems.

Aspects: People

Mon, 01 Jan 0001 00:00:00 +0000

Cloud Native Maturity Model - People

The Cloud Native Maturity Model is composed of six separate documents - the Prologue and the five key reference documents:

People Introduction

As you adopt cloud native technologies, your team’s expertise will grow. The Cloud Native Maturity Model spans five key dimensions: People, Process, Policy, Technology and Business Outcomes. This paper focuses on People—the foundation of a successful deployment.

Progressing from level one to five, your team starts with basic technical knowledge. Over time, they invest in training, build competency, and shift responsibilities to developers. Maturity is reached when DevOps and DevSecOps are fully integrated, and your team confidently explores new technologies. Leadership embraces cloud native as the future, recognizing its business value and driving agile transformation.

Level 1: Your business aligns with cloud native goals, and leadership understands its benefits. The team is new to the technology but has basic technical knowledge and relevant qualifications.
Level 2: Individuals actively train and build skills, creating small groups of subject matter experts (SMEs). DevOps emerges as cloud engineers and developer groups contribute platform skills. Leadership begins taking ownership of cloud native initiatives.
Level 3: Competency expands across Dev, Ops, and Security, with formalized expertise, standardized practices, and accelerators. You may have a platform engineering team (see platform engineering maturity model) Cloud-native is integrated into business strategy as a core principle.
Level 4: Competency shifts to development teams, enabling self-service infrastructure. Leadership fully commits, driving cloud native transformation across the organization.
Level 5: Your organization operates with fully integrated DevOps and Platform Engineering. Teams confidently experiment with new technologies and sandbox trials, continuously innovating and adapting to change.

Organizational Change

With the adoption of cloud native technologies, your business will undertake organizational changes. You must be ready for this as teams, who you hire, and how you structure infrastructure and development will change.

Level 1: In the early stages of cloud native transformation, organizational support is limited, with efforts centered on a proof of concept (POC) or a single application. Agile methodologies are being tested, but team structures remain largely unchanged. Cloud initiatives often start with informal, cross-functional groups, leveraging diverse skill sets for early experimentation and learning. At this stage, collaboration is intense as teams work together to build foundational knowledge. As maturity progresses, collaboration evolves, becoming more structured and role-driven from levels one to five.
Level 2: Organizational change is underway. You’ll identify structural barriers, such as siloed teams that slow delivery. To improve agility, you’ll establish project teams, adopt agile methodologies, and explore frameworks like Team Topologies and value streams for faster feedback loops. Informal, cross-functional teams won’t scale effectively due to conflicting demands, requiring formal roles, responsibilities, and potential funding discussions. As you move to production, SLAs make this shift even more critical.
Level 3: As your team’s competency grows, your organizational structure solidifies to support best practices, which are curated and encoded for reuse—often driven by facilitators like a Center of Excellence. Roles and workflows are formalized for consistency, exposing high-overhead services that rely on tickets, calls, and manual requests. This explicit coordination and collaboration between teams and individuals can highlight obvious services that would benefit from being “engineered as a service” and will evolve into platform engineering teams. The collaboration itself is valuable information that we can use in combination with the other data and metrics we are collecting.
Level 4: You’re transitioning to platforms and value streams, with services efficiently delegated to product teams. Developers can focus on product code aligned with business goals, seamlessly consuming necessary services through efficient interfaces—eliminating time spent on platform and infrastructure tasks outside their core responsibilities.
Level 5: At full maturity, the organization is fully committed to cloud native, seamlessly integrating culture, technology, and processes for maximum efficiency. It rapidly adopts and discards elements based on their value, ensuring continuous optimization.

Teams and Decentralization

Just like workloads in the cloud, your teams will become fit-to-purpose and highly scalable.

Level 1: Teams are actively exploring cloud native tooling, primarily Kubernetes, with a clear goal of reaching production—not just experimentation. Work is typically conducted within an MVP program by a cross-functional team of developers, system engineers, and other experts with diverse technical experience. Success requires a strong commitment to developing skills across all layers of the stack, supported by frequent collaboration to maintain momentum. There’s easy access to different operations to ease difficult requests. For example, a network engineer can justify requests that are at first glance not appropriate. Leadership provides sponsorship and accountability to drive progress.
Level 2: Central services and responsibilities are being formalized, with a focus on consolidating tooling across technical domains like cloud engineering and middleware release. Developers are increasingly working directly with these service providers. Early scaling efforts emerge, often through informal methods like wiki documentation or repo cloning to replicate MVP success across other applications. Growth is happening, but without the necessary organizational reforms to support it. This drives the organization to reassess its structure, sparking active advocacy for change. As it gains momentum, and while new cloud-focused roles may be created, they often remain siloed, limiting their efficiency and understanding of dependencies.
Level 3: Service consolidation is accelerating, with development teams assuming clear responsibilities and consuming well-defined interfaces, ranging from email to fully developed APIs. However, the cognitive load of integrating various services is becoming a distraction, overwhelming teams that must independently manage consumption. To address this, organizations begin exploring a Platform Engineering strategy, treating the platform as a product with dedicated teams providing end-to-end support. Success requires the right skills, authority, and a team explicitly focused on platform enablement. Additionally, restructuring around technology value streams becomes essential, as siloed structures increasingly hinder efficiency. This shift requires a strong leadership mandate to drive change.
Level 4: At Level 4, the organization focuses on structuring its platform to centrally expose and integrate services for seamless developer consumption. The platform team consolidates previously fragmented capabilities into standardized, self-service interfaces accessible via portals, APIs, and CLIs. Developers move beyond email-based or Slack requests, instead consuming APIs and curated resources, shifting away from a consulting-based support model. This evolution embeds organizational policies and processes directly into the platform. Value Stream teams can now leverage platform capabilities with minimal handovers, enabling developers to request resources and services programmatically while staying focused on business logic.
Level 5: Teams have seamless access to the tools and capabilities they need to deliver business value without friction. Developers focus solely on business logic, free from platform and infrastructure concerns, while platform engineers ensure on-demand infrastructure availability. At this stage, the organization operates at peak efficiency—services are consumed without facilitation, teams remain lean, and both the technical ecosystem and organizational structure are right-sized. Tech debt is continuously managed within each domain, preventing accumulation. Efficiency drives every decision, ensuring the right resources are in place without compromising business needs (i.e. not downsizing to the detriment of the business).

Upskilling

We know you already have great people, but they need to be equipped to take on the new challenges inherent to the cloud.

Level 1: Your team must grasp the challenges cloud native solves, the fundamentals of containers and how they differ from virtual machines, and why cloud native is ideal for microservice architectures. Leverage available resources—community forums, internal experts, and external specialists—to accelerate learning. A strong understanding of Kubernetes’ declarative model and core resources will help newcomers navigate the ecosystem effectively.
Level 2: Your team has grown comfortable with YAML and core cloud native tools, developing the skills needed to run production applications. Through hands-on experience, they’ve learned workload management, secure containerization, and RBAC, successfully deploying Kubernetes on-premises or via a managed service. Developers and platform engineers are gaining deeper insight into cloud native challenges, not just through their own work but also by implementing third-party security tooling. The team is shifting its mindset focusing on non-functional requirements: prioritizing scalability, automation, and service interfaces over managing infrastructure. As they rethink traditional architectures, they are building a stronger foundation for long-term cloud native success.
Level 3: At this stage, teams are advancing beyond basic cloud native adoption, extending cloud platforms and leveraging the broader CNCF ecosystem. Senior and principal engineers play a crucial role in scaling these efforts, bringing deep expertise in Kubernetes, CI/CD, policy management, and public cloud operations. Cloud-native knowledge is widespread, with skills distributed in a bell curve—many at an intermediate level, while others push boundaries by experimenting, refining, and proving competency through CKA/CKAD certifications, bake-offs, and hands-on implementation. The organization confidently integrates new CNCF projects, such as extending Kubernetes with Crossplane or building developer portals with Backstage, solidifying its ability to navigate the messy middle of cloud native transformation.
Level 4: Cloud-native adoption is now deeply ingrained across teams, fostering a culture of shared knowledge and continuous learning. Developers, platform engineers, and architects are actively refining processes, selecting tools that align with organizational policies, and proactively delivering business value. Senior engineers and architects play a crucial role in guiding teams as they expand Kubernetes and integrate cloud native technologies to meet specific needs.

At this stage, developers evolve beyond writing code—they become problem-solvers, whether building applications, optimizing platforms, or extending infrastructure. A new generation of senior engineers emerges, comfortable with cloud native complexities and capable of developing operators, custom resource management, and automation frameworks. The shift from reactive problem-solving to strategic, proactive decision-making empowers teams to innovate with confidence.

As responsibilities grow, technical leadership is essential at every level. Managers and senior decision-makers must develop technical fluency to evaluate and approve solutions effectively. With layers of abstraction and self-service platforms, teams can work efficiently within well-defined guardrails, ensuring both autonomy and alignment with business goals. This has the effect also of allowing junior staff to become much more productive, much more quickly as they are able to direct their attention to functional development and engineering without having to be caught up in environment setup and other concerns not directly concerned with their main responsibilities.

A further mark of maturity at Level 4 is the quality of documentation. At this level it covers all aspects of platforms as well as infrastructure, including design as well as operational documentation. This is maintained alongside technical artefacts such as code and kept up today with appropriate archiving as features are removed and technical debt reduced.
Level 5: Your organization is now composed of highly capable individuals and teams who fully leverage cloud native capabilities, engaging effectively across technical, architectural, and business domains. Teams can swiftly identify opportunities to drive business value and reduce technical debt, fostering a culture of continuous learning and innovation.

As a learning organization, you stay ahead of trends, integrating both mature CNCF projects and emerging sandbox technologies. Skilled engineers contribute to open-source projects, creating opportunities for outreach and deeper industry engagement.

Teams are small but diverse, equipped with broad expertise to mitigate key-person risk and ensure stability. Strong communication protocols and a robust professional development pipeline sustain long-term growth, enabling teams to adapt, collaborate, and innovate effectively.

Security

With all the many attack vectors that have to be mitigated remotely in a cloud native context, it is imperative to have a security-first posture for all aspects of the application lifecycle (including software delivery and ongoing operation/maintenance).

Level 1: Security is heavily reliant on individuals manually configuring, learning, and adhering to best practices across development, source control, integration, testing, troubleshooting, and cloud resources. Clear guidelines define preferred, safe, and forbidden practices, along with documented workarounds that include measures to minimize or reverse their impact. Some security responsibilities are delegated within the cloud native team, requiring close coordination with core security teams. This may involve strict skill requirements or a dedicated security expert to ensure alignment and compliance.
Level 2: With production workloads, your team must be fully equipped to uphold security best practices. Building on Level 1, team members now have clear, well-maintained guidelines, eliminating reliance on workarounds. Accountabilities are defined, ensuring security is a shared responsibility across product and security teams. Your organization invests in cloud native security training and certifications, empowering key team members to implement and enforce security policies effectively. Engineers and operators take ownership of mutual TLS, automated secret management, artifact provenance, and vulnerability scanning, integrating security seamlessly into daily workflows and strengthening the organization’s overall security posture.
Level 3: As teams mature, manual security practices are replaced with third-party tooling for tasks like secret management and mutual TLS. Supply chain security becomes a priority, with a focus on artifact provenance, bill of materials, and vulnerability scanning. Clear accountability and operational responsibilities are essential, as security remains a shared commitment in cloud native environments. Specialists in infrastructure, platform, and application security ensure depth in key technical domains. Identity and Access Management and physical access controls are established and rigorously maintained, reinforcing a culture of security at every level.
Level 4: Your team is embracing Zero Trust, shifting from default allow to default deny, ensuring access is granted case by case. This level of control is possible through integrated platform services authenticated by a single identity provider. Developers, engineers, and end users—from finance and HR to customers—experience seamless, secure access to approved capabilities, while the organization gains greater visibility and control over security and compliance.
Level 5: Your team operates at peak efficiency, with the right people, tools, and policies ensuring workloads and teams can handle requests safely and seamlessly. You rely on experts who can articulate what must be done, how to do it, and what’s possible. Clear, written guidelines support well-defined procedures, carried out by skilled engineers. Threat detection is embedded, whether preventative or retrospective, while security roles focus on policy development or technical execution. The organization actively contributes to public policy and open-source tooling, reinforcing its leadership in security and best practices.

Culture

The adoption of cloud native technologies requires a significant cultural shift. A team’s willingness to embrace new ideas, experiment, and collaborate is a key indicator of its ability to progress through the maturity levels.

Level 1: Initially, the cultural shift is limited to a small, dedicated team focused on a proof-of-concept (POC) or a minimum viable product (MVP). This team, often a combination of senior specialists and innovators, applies existing knowledge to new cloud native challenges. The broader organization’s culture remains largely unchanged at this stage, with cloud native efforts viewed as an isolated experiment.
Level 2: As you move toward production, your team’s interest may exceed its capabilities, requiring investment in training. The initial POC team may feel the burden of supporting new projects, highlighting a need to upskill operations staff to better support developers. The organization’s overall culture still favors traditional applications and platforms, making cloud native development feel like an “island.” A low level of trust may exist, with some viewing cloud native as a threat, creating competition and insecurity.
Level 3: Cloud platforms are now widely accepted by product owners and business leaders. It’s critical to demonstrate the benefits of cloud native to cultivate trust and overcome any lingering resistance. You know you’re approaching maturity when technical teams, management, and leadership are all comfortable with cloud native. While competition may still exist (e.g., implementing competing tools like Argo vs. Flux), a high level of trust is a vital commodity for continued adoption.
Level 4: Your organization has crossed the cultural chasm. Cloud native is the predominant mode of operation, with traditional application patterns becoming the exception, approved only under specific business requirements. Cost-consciousness is high, and the organization focuses on balancing innovation with efficiency, aiming to be “lean but not anemic.” Trust becomes a major theme, and the cultural shift is fully embraced.
Level 5: At this level, a culture of ease, trust, and respect prevails. Burnout is no longer a dominant topic. Conflicts and disagreements are addressed quickly and effectively, with a shared confidence that everyone is working toward the same goals. This culture is a key driver of efficiency, enabling continuous improvement and innovation without friction.

AI

As well as its profound effect on technology, AI brings in new roles and responsibilities, some of which are new to some organizations, and requires new skills and specialties not typically found in many companies.

Level 1: At this first level, you’re not in production and individuals and teams primarily operate within their traditional disciplines, with limited understanding or direct integration with the complexities of cloud native AI systems. AI may first be used by many as a finished product or service, with little understanding of its underlying mechanisms, or how it is built or deployed.

Data scientists and ML engineers that are familiar with data and machine learning develop scripts locally and then may have them reengineered by Distributed Systems Engineers for distributed execution, at scale, perhaps on non-cloud native platforms, perpetuating a clear division of labour and expertise. At Level 1, as you take your first steps into cloud native AI, a need for close collaboration between AI and cloud native platform engineering teams emerges as the need becomes clear for specific engineering to cater for complex configurations and tasks like GPU virtualisation and dynamic allocation. AI practitioners find they need to engage in activities outside their core ML expertise such as becoming familiar with Kubernetes and containers, and, from their perspective, other platform or infrastructure concerns. The first AI/ML pipelines in Kubernetes may include multiple technologies that are not tightly integrated.
Level 2: As we enter production efforts are made to simplify the interaction between AI practitioners and cloud native platforms through abstraction layers and improved tooling. AI practitioners start to look towards user friendly and well known SDKs that abstract away Kubernetes details. Debugging may remain challenging, requiring involvement between both AI/MLOps teams and platform engineering.
New roles or responsibilities may need to be created to help with governance and compliance. These may include model validation as well as ensuring activities are compliant with rules such as the European Union’s Artificial Intelligence Act.
Level 3: As we scale out AI within the organization, we may start to see the emergence of the MLDevOps or AI Engineer as the glue between data science, platform engineering, infrastructure, and development. We may also see the emergence of trust and safety experts to manage content and conduct, mitigate abuse and protect user rights and brand safety. Operators will likely start to develop skills using AI-powered tooling for Kubernetes management such as K8sGPT for the natural language processing of logs, or using machine learning to analyze massive datasets for security threat identification.
Level 4: The “AI Engineer” role evolves to focus heavily on AI tooling, infrastructure and deploying AI chains and agents, indicating a deep specialization in AI-driven operations. All personnel benefit from highly automated and intelligent systems. Administrators and site reliability engineers benefit from natural language interfaces for cluster control, which significantly lowers their learning curve for managing complex Kubernetes clusters. Skills in continuous optimization for multiple criteria (e.g., power conservation, latency) become crucial, often facilitated by AI-driven models
Level 5: While we are attempting to be all encompassing in this model, the Cartografos Working Group would like to acknowledge that level 5 AI maturity is outside of our remit. AI is changing rapidly and what may be level 5 today might not be tomorrow. We are excited to see how AI will impact people’s skills.

CNCF Certifications

The Cloud Native Computing Foundation (CNCF) provides a vendor-neutral home for leading open-source projects like Kubernetes, Prometheus, and Envoy.

To build a sustainable cloud native ecosystem, investing in CNCF certifications is essential. Organizations should consider CKA and CKAD at Levels 2 and 3, with CKS becoming relevant at Level 4 to deepen security expertise.

Certified Kubernetes Administrator (CKA)

This program provides assurance that CKAs have the skills, knowledge, and competency to perform the responsibilities of Kubernetes administrators.

Certified Kubernetes Application Developer (CKAD)

This exam certifies that users can design, build, configure, and expose cloud native applications for Kubernetes.

Certified Kubernetes Security Specialist (CKS)

This program provides assurance that a CKS has the skills, knowledge, and competence on a broad range of best practices for securing container-based applications and Kubernetes platforms during build, deployment and runtime. CKA certification is required to sit for this exam.

Kubernetes and Cloud Native Associate (KCNA)

KCNA is a pre-professional certification designed for candidates interested in advancing to the professional level through a demonstrated understanding of Kubernetes foundational knowledge and skills.

Kubernetes and Cloud Security Associate (KCSA)

This certification is ideal for individuals interested in learning about or working with cloud native security technologies.

Kubestronaut Program

Individuals who have successfully passed every CNCF’s Kubernetes certifications – CKA, CKAD, CKS, KCNA, KCSA.

Aspects: Policy

Mon, 01 Jan 0001 00:00:00 +0000

Cloud Native Maturity Model - Policy

The Cloud Native Maturity Model is composed of six separate documents - the Prologue and the five key reference documents:

Introduction

The Cloud Native Maturity Model spans five key dimensions: People, Process, Policy, Technology and Business Outcomes. This paper addresses policy, the critical implementation of internal requirements, and compliance to external regulations.

The journey of adopting policy in cloud native environments parallels the learning curve of Kubernetes and cloud native technologies, transitioning from an imperative to a declarative approach. Rather than specifying the procedure, this dimension outlines outcomes and results you should anticipate. Understand that policy adoption is a gradient, every organization has a different risk appetite.

Policy comes from both internal and external sources, forming a set of rules and requirements that the technical organization must interpret and comply with. In fast-changing environments, policy can be a contentious topic, often leading to disputes. Recognizing these challenges—and understanding how to mitigate them—is essential.

Policy input generally falls into three categories, each targeting different layers of the organization:

Regulators (Business) – Ensure the integrity of the industry, such as finance, healthcare, or utilities.
Legal (Compliance) – Mandates like GDPR and EU DORA, enforced by regulators.
Technical (Standards & Guidelines) – Frameworks from institutions like NIST and MITRE that provide technical guidance.

Not all policies carry the same weight, making it important to distinguish between regulatory, legal, and technical requirements.

Policy Level Overview

Level 1: You will have a limited set of documented policies in place to support services you’re building in the cloud. You are working with compliance teams to evaluate policy obligations and how they may change with cloud native technology.
Level 2: As your services reach production, you have initial policies agreed upon and these are recorded.
Level 3: You will implement policy-as-code and build this into your processes.
Level 4: You now have defined SLAs around policies and remediation. Treating policy-as-code and enforcement as subject to a software development lifecycle.
Level 5: Based on your learnings, you will refine your policies as your organization achieves maturity, taking advantage of technologies such as machine learning in order to improve detection and enforcement. Your policy apparatus is constantly evolving with the changing cloud native threat landscape.

Policy Creation

You will need to translate your organization’s policies and compliance requirements to your cloud native environment.

Level 1: When adopting cloud native environments, start at Level 1 by understanding your application’s functional and architectural requirements. Map these to both internal policies—such as infosec, physical security, and business policies—and external regulations from authorities and industry bodies.
You’ll need to reconcile existing policies with new cloud native guidelines and implications. Open communication between cloud native and policy teams is essential, ensuring stakeholders understand key policy requirements and can determine where traditional policies apply—or don’t. Intent matters as much as implementation.
To maintain velocity, identify policies that may create friction and discuss how traditional approaches, like universal backups or hot standbys, may not fit cloud native environments. Where necessary, engage policy stakeholders to refine or update policies, secure exemptions for early-stage or non-production use, and implement minimum production requirements (e.g., secrets management) at Level 1 to ease the transition to the next stage.
Level 2: Address the most critical issues first, such as audit requirements and major security threats. Treat this process as paying down policy exemption technical debt—whether by creating new policies, adhering to existing ones, or reassessing past decisions. At this stage, tactical measures like manual procedures may be necessary to meet policy requirements, but these should be recognized as temporary workarounds that contribute to technical debt.
Level 3: The cloud team should actively influence policy decisions by collaborating with business leaders and key stakeholders (rather than a one way dialogue with exception requests). The cloud team may also serve as subject matter experts within or alongside the compliance team. Given cloud native’s reliance on automation, identify manual processes from Level 2 that need to be addressed. Conduct thorough threat modeling and prioritize the highest-risk areas of the organization’s workflow. For example, if developers are introducing out-of-policy code—such as code without an SBOM—understand the reasoning and work to address the underlying issue.
Level 4: At this stage of maturity, you will customize policies to align with your organization’s business needs while minimizing exceptions. This will integrate business functionality and logic into your technical policy infrastructure. Over time, your organization will gain a clearer understanding of its highest risks and greatest value areas, leading to a stronger emphasis on effective classification. Earlier levels may rely on external guidance without fully aligning with internal skill set requirements, but by Level 4, this matures. There will be a well-developed ability to classify risks and a more strategic engagement with technical, operational, and business risks. Policy decisions should be actively shaped by real-world learnings. Technical edge cases arising from enforcement should inform policy evolution. Additionally, at Level 4, organizations may begin leveraging LLMs to accelerate both technical and written policy creation—while ensuring accountability remains intact.
Level 5: At this stage, while still drawing from frameworks like CIS and NIST, your organization should also focus on giving back by actively contributing policies to the open source community. Regulators will likely seek input from Level 5 organizations to shape policy, making your expertise highly valuable in these discussions. Your deep understanding of what works—and what doesn’t—positions you as a key partner in policy formulation.

Leverage multiple data sources, including user activity, log data, internal datasets, and governmental guidelines, to drive informed policy decisions. With your advanced capabilities, you can share insights through case studies, thought leadership, and open source contributions, helping to shape the broader industry landscape.

Implementation and Compliance

You will need policies in place to implement compliance especially in highly regulated industries. For compliance, there is a gradient of what you will achieve.

Level 1: Take the time to understand your compliance requirements, such as CIS, NIST, and PCI. Define SLOs and set compliance priorities early on. While this may not be a strict pre-production requirement, its importance will grow as you move toward production. Recognize that existing policies may not seamlessly translate to a cloud native environment. Be prepared to address compliance concerns with different technical implementations—for example, using network policies to isolate cloud native platforms that might otherwise violate compliance or monitoring for anomalous behavior.
Level 2: Now that you’re in production, ensure your primary compliance obligations are enforced. This may start with initial auditing, conducted manually or through simple scripts. Establish effective log collection across your infrastructure, platform, and application stack to enable meaningful analysis. Begin paying down the technical debt from Level 1 by implementing policies where exceptions or exemptions were previously granted.
Level 3: Policy compliance and auditing are increasingly automated within Kubernetes, often incorporating policy-as-code. Organizations may evaluate both generalized and specialized cloud native policy enforcement tools, leading to a proliferation of tooling. It’s important to identify opportunities for consolidation.

Manual procedures from Level 2 should now be encoded as services, leveraging policy-as-code tools and CNCF ecosystem solutions to enforce policies effectively. It can be tempting to set a high bar of entry for tooling (e.g. to impose restrictions on tool adoption, or to enforce standards too early); however organizations should remain open to evaluating new projects as the cloud native landscape evolves.

This phase also presents an opportunity to strengthen the software supply chain by using emerging software patterns and tools to enhance policy enforcement. Just as a programming language becomes self-hosting when it can compile itself, organizations should aim for a policy framework that can enforce and refine itself through automation.
Level 4: Technical and business decision-making actively drives the consolidation and standardization of technologies, moving away from arbitrary or ad-hoc choices. Policy tooling expands beyond infrastructure to include applications such as traffic proxies, service meshes, message buses, and Linux, increasing the scope of managed policies while maintaining control through declarative configurations.
Key activities include applying policies to business applications or extending policy enforcement to middleware. Every enforcement gate should be backed by a business decision. Integrate policy into the software delivery lifecycle—unit test, smoke test, and integration test your policies. Treat policies as code and include them in your SDLC to ensure consistency, reliability, and compliance.
Level 5: Compliance never ends! Strengthen your feedback loop with stakeholders and leverage advanced machine learning, ai and other tools to establish baselines, detect anomalies, and ensure visibility across large volumes of compliance data. At this stage, you may contribute to policy development by participating in organizations like NIST, the Linux Foundation, or industry-specific policy groups, providing valuable insights to the broader community.

Your technical implementations, security strategies, and defense posture can now be shared more openly than ever before. By contributing your expertise, you help the cloud native community navigate an increasingly complex and volatile landscape.

AI

Policy is a vital topic within AI, both internally within the organization, and from outside the organization in terms of legislation and regulation. As well as being technical, there are also ethical considerations with AI.

Level 1: This is where there is foundational consideration given to AI. Awareness of any data privacy regulations is critical, such as GDPR, as well as, for example, the European Union AI Act. Internal model registries should be implemented where required. Technical protections should be implemented, and data strategies for model training should also be developed. Data classifications and existing data separation techniques may not necessarily fit as models in development may need access to production data for training purposes. There also should be an acknowledgement of the need to address potential bias in data.
Level 2: This level involves implementing policies and practices to address immediate production needs and standardising initial approaches to security and data handling. General best practices for security should be followed, including penetration testing and compliance checks relevant to the workload industry, such as finance or healthcare. Ensure that there is clearly defined data ownership data lineage throughout the AI lifecycle. AI models are only as good as the data they are trained on, so it is important to actively monitor and address potential biases in the data and algorithms.
Level 3: As you scale, it is important to bring in increasing supply chain best practices such as image validation, attestation, signing and data provenance. Continuously validate models for potential biases and to ensure ethical outcomes. This includes a “safety by design” approach that embeds user safety and rights into the design and development of products.
Level 4: In line with the AI discussion in the technology section, we see a convergence where AI itself is used to enhance governance, security and compliance within cloud native systems, growing a symbiotic relationship. This means, for example, considering MLOps pipelines to capture and maintain data provenance. AI may also be at the core of your tooling to Trust and Safety, managing content and conducting scans for risks, mitigate abuse, and protect brand safety.
Level 5: This stage is characterized by the full automation and optimization of policy adherence and the integration of AI into all operational aspects of cloud native environments, leading to peak efficiency and sustainability. Internally, within the cloud native environment, Kubernetes controllers integrate with a natural language interface using LLMs, understanding logs, managing resource usage and many other tasks. Full data provenance and lineage is automatically captured and maintained, ensuring transparency and accountability. In terms of sustainability, there is continuous optimization of resource scheduling - not just for performance and cost, but also for power conservation, ensuring maximum energy efficiency and reduced carbon footprint. Finally you see a future state where policies for responsible AI and ethical use are not merely applied but autonomously monitored, adapted and enforced by the AI systems themselves, in alignment with the “Safety by Design” philosophy at an architectural level.

Aspects: Process

Mon, 01 Jan 0001 00:00:00 +0000

Cloud Native Maturity Model - Process

The Cloud Native Maturity Model is composed of six separate documents - the Prologue and the five key reference documents:

Introduction

The Cloud Native Maturity Model spans five key dimensions: People, Process, Policy, Technology and Business Outcomes. This paper addresses the process. Your process will affect your cloud native footprint, cluster topology and sizing. It will support your CI/CD maturity and help to improve your team’s ability to ship applications faster.

As you start your transformation, you will lack process and consistency across implementations. However, that will change over time as you develop and document your processes and capability. Documentation should be close to code and potentially machine based. By the time you are achieving maturity, you will have a consistent and mature process with a template-based approach, and you will actively revisit your standards, identify configuration drift and adjust standards based on business requirements.

Process enables repeatability. In your ephemeral environment, you need to ensure that the processes implemented can be repeated across teams and clusters. This is core to your process implementation.

Level 1: You will map both functional (application features and code) and non-functional (performance, capacity, availability) application requirements while defining how your organization will scale. Feedback will be handled manually through Slack, email, or phone and remediation will also be manual.

To introduce repeatability, you will begin defining your Git workflow. Keeping platforms and technology up to date, especially with security patches, is critical, as vulnerabilities pose significant risks. Updates will likely be applied manually on an ad-hoc basis or through built-in distribution update systems.
Level 2: The focus shifts to promoting basic applications into production. By this stage, you should have a well-established Git and CI workflow. You will implement structured build and deployment processes that align with cloud-native and container-native CI/CD principles.
Level 3: Standardization across the organization becomes a priority, streamlining onboarding and expanding cloud-native adoption. You will establish a feedback loop and invest in repeatability.

Key considerations include:
- Ensuring accessibility to necessary tools (e.g., Git services, workspace collaboration) to save time and reduce duplication.
- Implementing a process for measuring resource usage, including container utilization, CPU, and memory (runtime and uptime).
- Expanding automation to software release processes and platforms.
- Enhancing lifecycle operations, such as patching and upgrades, particularly for CVEs and critical updates, by incorporating Infrastructure-as-Code (IaC) tools.
Level 4: Governance will fully support DevSecOps, with guardrails in place to facilitate agile software development. You will establish an application services library and set policies around container usage, such as auto-scaling or high-performance computing (HPC) policies.
Level 5: Achieving process maturity will see you build design capabilities for cloud native. You’ll also automate responses by using monitoring failures to restart or manage problematic and failing resources. Resource usage data will play a key role in cost optimization, and your processes will include providing business cost analysis, ensuring efficiency and financial accountability in cloud-native operations.

Audit and Logs

Your process will incorporate logging and auditing, whether to meet internal requirements or to support compliance mandates—both internal policies and external regulations.

Level 1: Manual log scraping is likely ad hoc, with no centralized logging system or SIEM in place. Log aggregation and retention may be inconsistent or not exist across your cloud native stack, including infrastructure, platform, and application layers.
Level 2: Define a clear log aggregation strategy to ensure comprehensive logging for your production workloads. As you’re now in production, capturing and managing logs is critical. Consider implementing long-term archiving to support compliance, troubleshooting, and historical analysis.
Level 3: Begin enabling audit capabilities and configuring the most critical alerts if you haven’t already. Reduce noise by filtering irrelevant data and ensuring logs are structured for easy retrieval and analysis. Focus on capturing and presenting the most business-critical insights to support informed decision-making.
Level 4: Audit and alert systems are treated as essential production components and enforced across all applications. You have a defined strategy for indexing, partitioning, and managing large volumes of data, ensuring it is handled with the same level of importance as business-critical information. Regularly test your ability to respond to audit requests on short notice to maintain readiness and compliance.
Level 5: At maturity, you have a clear understanding of the value of the data you retain. You can now determine what should be summarized for long-term retention and what can be safely deleted. Compliance obligations may also require the disposal of certain data after a set period. Consider leveraging logging data as a source for machine learning and AI to uncover new insights and improve decision-making.

Software Integration and Release

Central to your cloud native transformation is the adoption of a software integration and release process to help you build, test and deploy applications based on modern software development practices.

Level 1: You may not have a formal change control process in place. Instead changes occur on an ad-hoc or informal basis. If you already use CI/CD, it’s essential to adapt and evolve it for your cloud-native environment. This means building on existing best practices while ensuring they align with cloud-native principles.
Level 2: For your application, implement structured build and deployment processes that align with cloud-native and container-native integration and release practices. Code quality is improving, as measured by automated tooling, and you are consistently achieving successful CI runs and test validations.
Level 3: You are establishing a center of excellence around your software integration and release process, focusing on measuring and improving release velocity, defect rates, and deployment cadence. While you have implemented continuous delivery, production deployments still require an approval gate.

At this stage, you are experimenting with advanced deployment strategies such as blue-green and canary releases to optimize reliability. Defects, hotfixes, and bug fixes are trending downward, reflecting improved stability. Best practices are now firmly in place, and human access to production has been eliminated in favor of GitOps operators, ensuring a more secure and automated deployment process.
Level 4: You can now demonstrate the value of your software integration and release process to the organization, showcasing tangible improvements in velocity, deployment speed, and business impact. You may have implemented DORA metrics and will want to map those to business outcomes.

With optimized delivery methods, new features ship faster, accelerating innovation. Your quality engineering (QE) capability is well-established, ensuring automated deployments to production, where only failed tests can block a release. Additionally, monitoring failures trigger automated responses, such as restarting or managing failing resources, enhancing system resilience and reliability.
Level 5: You are actively exploring and investing in innovative strategies—including potential system refactoring, infrastructure optimizations, and runtime or programming language improvements—to enhance both the safety and speed of software integration, testing, and deployment. With a streamlined release process, successful deployments happen efficiently, enabling the business to adapt quickly to evolving market conditions and stay ahead of the competition.

Security

Integrating security tooling and best practices into your cloud native environment as early as possible is essential for maintaining a strong security posture. The concept of “shifting left” emphasizes embedding security—alongside testing and other critical practices—early in the development lifecycle.

Security is woven throughout the Cloud Native Maturity Model, with each level playing a role in strengthening an organization’s security posture. A comprehensive approach ensures security teams can drive maturity across the software supply chain, platforms, and beyond, recognizing security as a cross-cutting concern that impacts every aspect of cloud-native operations.

Level 1: Make security a first-class citizen in every aspect of implementation. Understand the unique security challenges across your infrastructure, network, platform, applications, and code, and proactively address vulnerabilities and misconfigurations. Prioritize security from the start—building it in from the beginning is far easier than retrofitting security practices and tooling later.
Level 2: Build security into your software integration process and runtime environment including container scanning and configuration scanning. You are extending your existing policies into your cloud native ecosystem.
Level 3: Implement automatic continuous scanning to flag misconfigurations or security issues.
Level 4: Ensure security remediation is automated and/or identified automatically with remediation advice.
Level 5: The software supply chain is secured, with reproducible builds and software bills of materials providing insight into code and dependencies, with clear code provenance and secured release pipelines.You’ve shifted security left. You are preserving security by continuously monitoring Kubernetes for security and vulnerabilities.

Aspects: Technology

Mon, 01 Jan 0001 00:00:00 +0000

Cloud Native Maturity Model - Technology

The Cloud Native Maturity Model is composed of six separate documents - the Prologue and the five key reference documents:

Position on Included Technologies

The Cloud Native Maturity Model includes references to only CNCF graduated or incubating projects. The Maturity Model’s default position on CNCF sandbox projects will be to exclude unless referenced in later stages of maturity (i.e. users that have achieved level 4 or 5). It does not and will not include any reference to commercial software.

Introduction

The Cloud Native Maturity Model covers five major dimensions - People, Process, Policy, Technology and Business Outcomes. This paper addresses technology - the practical tooling that makes up cloud native applications, platforms, and infrastructure. As well as referring to specific technologies, this paper aims to show the stages you may go through as you move from starting out all the way through to cloud native excellence.

This paper illustrates just one path, but all journeys differ. This is absolutely as it should be, as organizations all start out at different points and have different destinations (business outcomes). Different locations, sizes, starting points (greenfield or long established), regulatory environments, and of course people, all influence the cloud native journey.

The technology section of the Cloud Native Maturity Model is not exhaustive. We would love contributions to ensure the model is robust and useful for all users. All readers are encouraged to submit GitHub PRs with comments and suggestions.

The Technology Overview

We anticipate you’ll have a reasonable understanding of why you want to adopt cloud native technology i.e. your expected business outcomes. Clarity on why you wish to undertake this journey to achieving full cloud native adoption is your largest asset. At a high level, the key steps in your cloud native journey will look something like this:

Level 1: You’ll have your initial experimentation and adoption of Kubernetes. You’ll start with relatively basic tools and technology. You’ll assess your existing toolset to see how they fit within the new landscape (what plays well with cloud native, and what doesn’t?). You’ll have limited automation, but don’t worry, it’s coming! Your focus is on getting the baseline technology implemented, and you won’t be in production yet.
Level 2: This marks your first step into production. You’ve worked hard to build your foundation in Level 1, and now you are moving to production. You might have started with something relatively small and simple, but this leap to production has certainly required you to address some significant steps. You’ll probably have had to incorporate monitoring and observability into your workloads. You’ll have brought key observability tooling in and started monitoring your clusters for standard metrics such as RAM, CPU etc. While you might be starting to evaluate application tracing, don’t worry about it too much if you have started to gather core metrics. Your focus here is on getting an application running in production and having enough platform resource, observability and operational capability to support it within your organization.
Level 3: Here you start to scale. Your suite of tools is more standardized. You’re getting your release tooling, secrets management and policy tooling in place. You’re also starting to get a level of buy-in across your organization, which is helping to propel you forward. This is where you will be running the largest number of tools as you will be in the thick of evaluating, implementing, and running in production.
Level 4: You’ve got full control over your environment, and you’ve built your confidence, with rapid adoption of cloud native patterns for new applications and platforms. You’ve also gained organizational commitment to cloud native and this is adding to your momenting. You’re starting to feel like you’ve “crossed the chasm.”
Level 5: Your investment is now focused on automation in functional and non-functional areas such as scanning, policy, security and testing. You’ve got operators doing your operations for you and you’re fully automated.

Please note, technology is changing rapidly with AI developments. Within each section, you should consider how AI can help you improve or streamline actions. There is a huge opportunity with AI and while we want the Cloud Native Maturity Model to cover AI, we defer to the AI community within the CNCF for its expertise. You can read more in the CNCF AI whitepaper.

Architecture and Solution Design

This section outlines how architecture and solution design evolve from ad hoc development and basic planning to fully automated, policy-driven platforms with codified standards. As maturity increases, organizations align architecture with business goals, optimize for scale and efficiency, and empower developers through self-service and well-defined patterns.

Level 1: The focus is on getting workloads into development, without paying much attention to the specific availability requirements of components. The goal is to prepare for production and begin building cloud native capabilities. Capacity planning, such as network address space, should be considered early to avoid significant rework later. It’s tempting to start small when prototyping, but small designs often make their way into production. Invest effort in upfront planning for areas that are difficult to adjust later.
In cloud native environments, traditionally distinct architecture domains are now tightly integrated. Architects must work more closely with core service teams such as networking, security, and storage. From a solution design perspective, begin evaluating how your application might need to change to better support cloud native architecture (e.g., a monolith may not be suitable). Authentication and authorization can be challenging for legacy applications—consider replacing LDAP with OAuth or OIDC 2.0.
Begin building up the supporting services for your cloud native workloads. Consider the logical and technical interfaces that make up your cloud perimeter, as these heavily influence how workloads are architected and how they integrate with upstream and downstream systems. Avoid boxing yourself into a corner early on to prevent common issues later. When planning your production architecture, consult reference architectures such as those provided by the CNCF reference architectures.
Level 2: You’re now in production. As an architect, you’re thinking about the non-functional business requirements of applications, including performance, capacity, availability, disaster recovery, and security. As you implement platforms, consider how they can meet these needs and how they might fail—not only through infrastructure failure but also through logical corruption or misconfiguration. You may conduct exercises to identify failure scenarios and the risks they pose, which will guide engineering decisions.
It’s easier to meet requirements at higher levels of abstraction (e.g., instead of replicating a stateful VM, run multiple pods and load balance across them; use tools like Rook or Longhorn instead of replicating a SAN). Levels 1 and 2 are valuable stages for exploring and learning about the applications transitioning to cloud native and their service requirements, helping reduce operational risk later.
You are starting to establish production patterns that will be reused going forward. Common Terraform modules may be approved for specific workload types, and patterns for tools like secret management are being established. You are standardizing how platforms, services, and applications are built and maintained—using tools like Kustomize at scale, mandating Helm charts, or taking an operator-first approach. You may set standards for container integration (e.g., buildpacks or Containerfiles for everything).
A catalog of basic patterns is beginning to emerge. The focus is on raising the standard across applications, platforms, and infrastructure to ensure high-quality production services. This sets the foundation for scaling in Level 3. Security classification becomes a key non-functional requirement, alongside availability and recovery. For each classification, you should know which applications fall under it and have defined patterns for supporting security components—e.g., encryption policies, artifact placement, Kubernetes secret handling, and key rotation.
You are also starting to develop placement patterns (on-prem, cloud, edge) for infrastructure and platform services such as Kubernetes clusters, and to select right-fit solutions based on these. Even if the goal is full cloud rehosting, technical or policy constraints may require keeping some workloads or services on-prem.
Level 3: At this level, architectural planning becomes more intentional. You are operating in production and discovering new ways cloud native can deliver business value—such as supporting multiple standards (e.g., object storage), increasing platform agility, and improving cost and availability. Costs may initially rise before falling. New capabilities emerge—for example, handling bursty workloads by using spot nodes, which in turn improves availability.
A subtle shift occurs: tools previously outside the runtime platform are now integrated (e.g., GitOps and its dependency on source control repositories), which raises availability requirements for those tools. It’s essential to understand your dependencies within the cloud native ecosystem.
You may find preferred tools or standards don’t work for all use cases (e.g., the 80/20 rule), especially with specialized workloads that scale horizontally or vertically. These edge cases need their own standards. Rather than letting those teams create them in isolation, extend your portfolio of services and standards to include them. Having multiple teams consume a pattern will help with driving up quality over time as feedback is incorporated about what works and what doesn’t.
The growing cloud native ecosystem offers many new capabilities. Emphasis now falls on defining what is acceptable for use and under what conditions—based on non-functional requirements. For example, you might offer multiple placement region options, with tiered requirements for regional versus zonal clusters, depending on application criticality. Creating a capability catalog helps prevent overprovisioning and reduce costs.
Security rules and standards are codified as policies and checked regularly—including within the software supply chain. You have patterns in place for data sovereignty, data residency, and complying with regulations across jurisdictions (e.g., some countries require financial or medical data to stay within their borders). You must understand your provider’s residency guarantees.
At this stage, you’re expected to trial new technologies, vendors, and tools. Learn and decide quickly, and clearly document what is and isn’t supported. Think in terms of full technology lifecycles and retire tools to pay down technical debt as soon as possible.
You also begin standardizing workload migration processes and architectural requirements. A triage process may be needed per application to handle networking, IAM, and analysis components. Define a clear policy set for all application workloads during migration. Migrating workloads requires time and architecture guidance. For example: do we refactor? How do we replace authentication solutions previously available on-prem? Reference architectures and implementations help development teams accelerate migration and refactoring.
Level 4: The capability catalog becomes more refined, narrowing down to a short list of highly expressive options. You can deploy infrastructure, platforms, and applications rapidly—and decommission just as easily. Lifecycle management across infrastructure, platform, and application is in place, with a strong emphasis on automation.
You are far more efficient in resource allocation and take advantage of placement options (e.g., deploying in Iowa instead of the West Coast to save on cost). Interfaces are clean and consistent, and applications scale effectively within your platform. Interfaces are well-defined and come in multiple forms to suit different needs, all with strong observability.
Developer experience is excellent, with clearly defined stages, easy-to-use interfaces, and built-in guidance. The platform leads developers toward secure, compliant, and efficient designs. Policy requirements are encoded into the platform and surfaced through portals, helping developers build according to organizational standards.
Solutions design becomes less manual. Policy is baked into the platform, and architectural consulting focuses more on refining and reviewing designs for alignment with business goals. You can spin up entire environments to test new capabilities without impacting production.
As tribal knowledge becomes codified in the platform, it must meet the same security classification standards as the workloads it supports (e.g., usingapplying HSMs with OpenBao when supporting medical applications with patient data).
Level 5: Everything is self-service and fully automated. Environments are right-sized, secured, cost-effective, and provisioned or decommissioned on demand.
Developers have access to a mature portfolio of capabilities and services that meet all functional and non-functional requirements. Automation includes AI/ML for detecting issues, suggesting optimizations, and generating pull requests for rapid review. It is easy and safe for developers to get the capabilities they need while being guided toward the right solutions.
Architectural decisions are made with full awareness of their implications and impacts. Design reviews take the form of pull requests, with the entire application stack—including IaC and Kubernetes manifests—rendered in code, accelerating delivery.

Platforms and Infrastructure

For the sake of clarity, we regard a platform or infrastructure that is used by application developers as production.

Level 1: You’re beginning to build your cloud infrastructure, whether on-premises or in the cloud. It’s important to consider foundational technologies early—networking, firewalls, IAM, access controls, and policies—and whether any of these need to change. As you experiment with Kubernetes, keep track of emerging needs and decisions; these will serve as breadcrumbs guiding your journey toward cloud native.

Expect to address RBAC policies, load balancer or ingress configurations, cluster dashboards, privileged access (or the removal of it), and container logging. Your goal is to move from managing servers as ‘pets’ to treating them as ‘livestock’ by investing in declarative infrastructure with Infrastructure as Code (IaC) tools.

Platform teams need dedicated engineering clusters to validate and iterate on their infrastructure work. These clusters provide the flexibility to build, test, and tear down resources manually as needed. However, any cluster intended for application developers—including those labeled as “development” or “integration”—must be treated as production-grade. These environments should be provisioned using your strategic IaC tooling (e.g., OpenTofu), with versioned modules that ensure consistency across dev, test, and production. For example, if deploying GKE with specific NIC/subnet settings, the module should be versioned and reused across all application environments to ensure predictability and compliance.

If a consolidated DevOps practice isn’t yet in place, involve your future operations team now to build familiarity and alignment. For cloud service providers, you’ll also need to consider regions, encryption key management, and integration with your corporate network.

Regardless of environment, version-control everything and adopt IaC to manage configurations. You’ll be deploying frequently and must be able to tear down environments just as easily without leaving behind configuration drift or cruft (‘crumbs’ of left over technical detail, such as files or configuration).

You’re also beginning to shape your deployment and operating models. Key questions will emerge: Does the same team own both the app and the cluster? How are clusters provisioned? How are applications onboarded? Are you adopting models like namespace-as-a-service or cluster-as-a-service? Will there be a shared responsibility model for the clusters?
Level 2: At this stage, you’re focused on managing configuration consistently across your environment. You’re moving away from imperative practices, such as shell scripts or “ClickOps” (manual configuration through web UIs), and any remaining use of these approaches is tightly controlled, documented, and versioned in source control (e.g., Git).

You now have solid solution architecture in place to define key requirements. The Kubernetes cluster is central to your operations, and you’re successfully meeting non-functional requirements such as disaster recovery (DR), high availability (HA), and secure access through properly configured IAM and RBAC. Networking is aligned with application needs, and infrastructure capacity planning is actively balancing cost and performance. For stateful applications, selecting appropriate storage classes and volume types is critical.

This is a good time to consider adopting GitOps tools or managed services to initialize and maintain Kubernetes clusters, especially for setting up core components like ingress controllers or cluster-scoped operators. While these investments may carry upfront costs, they pay long-term dividends in consistency, security, and scalability.

You’ll also need to revisit decisions made in Level 1—some of which were valid at the time but may not scale. Paying down infrastructure technical debt is essential at this point to avoid bottlenecks later. Consistency is key to supporting broader scale as you mature.

Wherever possible, leverage the operator pattern for deploying and managing third-party tools. Operators ensure consistency and lifecycle management in contrast to ad hoc, manual installations.

Your deployment and operating models are now established and delivering value to the business. However, they may be overly tailored to your initial workloads, requiring either refinement or reevaluation to accommodate new teams or applications. Your chosen operating model—whether centralized or federated—will significantly influence technical implementation. For example, centralization may point toward models such as namespace-as-a-service.

At Level 2, given the production status of the workload, it is worth considering your overall platform strategy. There are architecture aspects to this, as well as engineering. Review the Platforms White Paper from the CNCF as well as the Platform Maturity Model. As the white paper explains, “a platform is an integrated collection of capabilities defined and presented according to the needs of the platform’s users.” The paper also explains that there are consistent user experiences for managing the platform’s capabilities such as web portals, project templates and self-service APIs. As such, consider that your organization does already have a set of capabilities that together provide developers the ability to get business functionality into production; however given the likely unintegrated nature of them, with potentially different user experiences with different tooling, you’ll want to consider creating an internal developer platform (IDP). At level 2, the key aim will be the thinnest viable platform,
Level 3: At this stage, you’re building confidence in your infrastructure by gaining deep visibility into how it’s operating. Monitoring, alerting, and resource usage tracking become top priorities—not just at the node level (CPU, memory, etc.) but also across the entire cluster. You’re evolving from simply running infrastructure to managing it proactively.

Whereas in earlier stages you may have remediated failing components manually, now you’re replacing and redeploying them automatically. You’re beginning to manage infrastructure like software, leveraging Kubernetes as the control plane for elasticity and self-healing behavior. This means offloading more responsibility to the cluster itself through mechanisms like horizontal and vertical pod autoscalers, as well as application-focused autoscalers such as KEDA.

Advanced Kubernetes scheduling practices come into play, including the use of priority classes, quality of service tiers, affinity rules, and TopologySpreadConstraints. These capabilities help manage resource allocation and eviction behavior, which are critical in elastic, multi-tenant environments where infrastructure directly impacts user experience.

Your infrastructure must now support a more sophisticated software delivery lifecycle. This includes provisioning sandbox environments for application developers, platform engineers, and infrastructure teams to experiment and validate changes safely. Your deployment and operating models become more flexible to support a wider range of workload types and team needs.

You’re beginning to collect and act on platform usage data to inform architectural decisions. For example, if 95% of teams operate effectively within a namespace-as-a-service model, you may double down on optimizing that offering. If others require dedicated clusters, you’ll need to decide whether to offer cluster-as-a-service or guide them toward scalable alternatives. These decisions are costly to reverse, so it’s essential to balance strategic vision with measured data from real usage.

Infrastructure becomes harder to change as it scales—so flexibility must increasingly come from the platform and application layers rather than the infrastructure itself. To support this, you’ll need to clearly communicate platform capabilities, limitations, and upcoming evolutions. You’ll also begin to prioritize investments based on both customer needs and cost-awareness, recognizing that in a cloud environment, every resource consumed has a price.
Level 4: By this stage, Kubernetes and its API are second nature. You’ve matured your infrastructure practices and Infrastructure as Code (IaC) tooling, and you’re likely exploring ClusterAPI to automate the deployment and full lifecycle management of your clusters.

As platform maturity increases, so does the need for refined control and governance. You’re now implementing policies across the infrastructure control plane and related controllers to ensure consistent behavior, security, and compliance at scale.

Your deployment and operating models have been further refined, recognizing that no single approach fits every team. Earlier solutions—like namespace-as-a-service—were effective for rapid onboarding in lower-risk environments. Now, with more sophisticated needs across development and product teams, you’re supporting a broader range of workloads and operator-based services such as storage, databases, messaging, and logging.

This introduces more complexity and risk with each change to your operating model. Development teams are making more demands, and your platform must be flexible without sacrificing stability. To meet this need, you’re investing in centrally maintained Infrastructure as Code templates (e.g., using OpenTofu), while still allowing for team-specific customization such as networking or other infrastructure components.

The challenge at this level is to plan and support a diverse set of workloads across the enterprise—from lightweight APIs to complex, large-scale systems with demanding functional and non-functional requirements. Flexibility, consistency, and strong governance become the pillars of your infrastructure strategy as you scale to meet the needs of a growing and diverse engineering organization.
Level 5: At this stage, your entire infrastructure lifecycle—provisioning, upgrades, and decommissioning—is fully managed through software and codified processes. Every infrastructure change is made via code, enabling repeatability, auditability, and rapid iteration.
You are maximizing efficiency and minimizing technical debt. Infrastructure is highly flexible, cost-optimized, and aligned with business needs, supported by robust self-service interfaces that empower teams without compromising control. FinOps practices are well-integrated, ensuring that both resource utilization and cost are continuously optimized.
Tight GitOps-based control loops are in place across the entire infrastructure, eliminating configuration drift and enforcing consistency. Everything is version-controlled, declaratively defined, and managed as part of a cohesive system.
Shared or pooled costs are minimized or fully allocated. You have clear, effective mechanisms for cost management and chargeback, ensuring that infrastructure expenses are transparent and aligned with team or business unit usage.

Application Patterns and Refactoring

As you begin your cloud native journey, start with a small, manageable application—ideally a stateless, greenfield microservice. This will help you validate the fundamentals: Kubernetes access (kubectl), networking, platform capabilities, CI/CD processes, and initial security patterns. It also provides an opportunity to define application architecture standards, deployment templates, and policy guardrails that can scale across future projects.

While microservices are a common starting point, Kubernetes is increasingly used as a general-purpose runtime—meaning monoliths may still exist or even be newly developed, depending on business needs. It’s important to define the differences between microservices and monoliths early, and understand how microservices typically align better with Kubernetes’ strengths around scalability, independent deployment, and resilience. Regardless of architecture, focus on patterns that enable gradual modernization—such as the strangler pattern—which can guide future refactoring efforts.

Early cloud native adoption should focus on simple use cases that expose platform and process gaps quickly. Application and platform teams should work closely together at the start, ideally with cross-functional teams (e.g., developers paired with cloud specialists), to accelerate learning and reduce duplication of effort. Over time, these functions will split as platforms mature and can support broader reuse.

You’ll likely be dealing with technical debt, and some applications will depend on third-party services with their own roadmaps. These realities should inform your prioritization. Application teams must also become familiar with the cloud native services and capabilities offered by the platform team and may need to collaborate with platform product managers to shape future needs.

These early applications will serve as blueprints for broader adoption.

Here is a working model for the microservices path. You may adapt this to your model.

Level 1: Begin by reviewing microservice patterns and architecture in the context of your specific applications. Non-functional requirements—such as latency, resilience, scalability, and third-party integrations—must be carefully considered. For example, splitting logic across pods can introduce latency (particularly across datacenters or regions), so it’s critical to evaluate architectural patterns early.
If you’re refactoring a monolith, expect significant redesign. Existing architectures may not have the technical scaffolding to support cloud native patterns. State management is a key concern—refactoring may require substantial changes here. This process should reinforce the understanding that moving to cloud native is a long-term commitment.
Cloud native platforms offer abstractions and capabilities that were previously hard to implement. With this flexibility comes the need to understand the quality and trade-offs of various infrastructure components—for example, object vs. block storage, or the selection of container network interfaces ( CNIs) or networking resources within Kubernetes itself: ingress controllers, and service meshes like the Gateway API. Gaining a comprehensive view of available options is essential as you refactor for Kubernetes.
A shift to declarative models introduces new non-functional requirements for application teams. One fundamental change is the ephemeral nature of infrastructure—developers must now design applications assuming that no single instance will persist. Availability responsibilities move from infrastructure to application, requiring developers to build resilience into the code.
Instead of depending on individual pods, applications must be managed via higher-level Kubernetes resources like Deployments or StatefulSets. This lowers infrastructure costs but increases developer accountability for reliability and availability.
Pods, by nature, are ephemeral. This affects caching strategies—developers must either implement persistent caching or use persistent volumes where necessary. Because applications will sit behind ingress controllers or load balancers, readiness and liveness probes become critical. End-to-end readiness checks—such as backend transaction validation—ensure only functional services are exposed to users.
Developers must also understand pod-to-container relationships and models like sidecars, which help separate concerns. IP addresses are dynamic, so service discovery must rely on DNS, the Kubernetes API or other appropriate means.
Level 2: You’re in production now, and the focus shifts to scale, availability, observability, and alignment between your platform and applications. You may be introducing service meshes and more advanced monitoring.
If you’re adopting GitOps, developers need a clear understanding of its key principles and how to get started. Irrespective of this decision, they should begin using Kubernetes-native configuration management tools. This includes externalizing configuration using ConfigMaps, Secrets, or other runtime mechanisms—rather than embedding configuration in the image at build time. This approach improves validation and reduces drift, making practices like git diff effective for tracking changes.
Because the platform is software, it requires regular maintenance. Kubernetes releases approximately three times per year, so establishing a proactive cluster lifecycle and maintenance process is critical. Regular updates should be scheduled as part of ongoing operations—not treated as exceptional events.
From the start, developers must understand that pods are ephemeral. They should account for this by implementing node and pod affinities, PodDisruptionBudgets, and TopologySpreadConstraints to ensure service continuity during cluster upgrades and disruptions.
Level 3: At Level 2, application patterns are well defined and there’s a strong push for consistency in foundational practices. At Level 3, you begin expanding beyond the basics and may encounter the limitations of your existing tooling.
Applications are refactored to better align with platform-native resource types—such as using object storage instead of persistent volumes—and to adopt operator-first patterns for lifecycle management. Where developers previously worked within a namespace-as-a-service model, they may now explore cluster-as-a-service options to gain greater isolation or flexibility.
You may introduce abstraction layers like Dapr to decouple infrastructure services (e.g., messaging, storage) from application code, simplifying development and improving portability. Kubernetes is no longer just an infrastructure platform—it’s evolving into a true application hosting platform, a foundation for your internal PaaS.
New application patterns are emerging, while older, less scalable ones are phased out. These shifts are guided by your evolving non-functional requirements and the capabilities of the underlying platform.
Level 4: At this stage, cloud native design patterns are formalized and shared across the organization—often documented in Git repositories or collaboration tools like Confluence. Consistency in implementation is not only visible but may now be actively enforced.
The organization strikes a balance between standardization and flexibility. While standardization is ideal for scale and maintainability, some variation remains to support the “right tool for the job” approach. This balance directly influences application architecture and development patterns.
By this point, the organization is converging on a well-defined set of tools and practices that align with both platform capabilities and business needs.
Level 5: At this level, all new greenfield applications are developed with a cloud native-first approach—unless specific requirements (such as ultra-low latency) dictate otherwise. You’re actively onboarding your existing application portfolio to the platform using proven, repeatable processes.
Applications are now fully aligned with platform strengths and capabilities. You’ve arrived at the “right tool for the job” through an organic, Darwinian evolution—where scalable, resilient services have emerged as the standard. Infrastructure providers no longer constrain your design decisions.
Mature, well-matched application patterns are in place, supported by robust self-service APIs across both the platform and cloud native tooling. These APIs offer a wide range of capabilities that enable teams to operate efficiently at scale.
At this level, you’re harnessing the full power of cloud native and the elasticity of cloud infrastructure. For many organizations, this ability to scale seamlessly can have a direct and significant impact on cash flow, operational efficiency, and long-term viability.

Container and Runtime Management

Level 1: At this stage, the focus is on learning to build and run containers. Teams must upskill in writing container files, building images, running them in clusters, and understanding how containers differ from virtual machines. Developers also need to work comfortably with containers on their local machines.
A platform team is established to build and manage Kubernetes clusters. Developers begin using development and integration tooling (e.g., Tekton), while infrastructure teams introduce Infrastructure as Code (e.g., OpenTofu) to provision cloud environments, including projects, VPCs, IAM, and Kubernetes infrastructure.
To prepare for production, container image builds are integrated into CI pipelines, and a container registry is adopted with clear versioning and tagging practices. Kubernetes is now the application runtime platform, and you are gathering all necessary deployment artifacts such as YAML files for Deployments, StatefulSets, Services, Ingress, LoadBalancers, PersistentVolumes, and more.
You likely begin using Helm charts (e.g., for Ingress-NGINX) and deploy your first operators for core functionality such as secrets management. Understanding the Kubernetes operator model and custom resources (CRDs) becomes highly valuable.
Level 2: Now running in production, you begin augmenting the basics with tools for security, policy enforcement, and workload configuration. You establish practices around container hygiene and begin defining policies for base images and dependency management. This may include:

Using standardized base images (e.g., UBI or internal Ubuntu mirrors)
Allowing upstream images from sources like Docker Hub or Quay.io with SBOM validation
Maintaining a catalog of hardened, source-built images

Security practices include automated scanning, runtime observability, and policy controls. CNCF projects become strong candidates to support observability and governance requirements.

Level 3: As your workloads grow and you scale operations, consistent tooling across clusters becomes essential for maintaining visibility into your Kubernetes environments. Tools like KArmada enable multi-cluster configuration management, helping ensure consistency and reducing configuration drift—even across regions or continents.

Namespace as a Service models are common, but maintaining efficiency becomes more challenging. Cost management becomes critical—when infrastructure scales by a factor of 10, even small inefficiencies (e.g., 80% vs. 90% utilization) have a large impact.

Release management complexity increases. You need to answer questions like: How do we support multiple clusters with a tool like Argo CD? Network planning becomes more important, particularly around IP address management and API server quotas. Even small clusters can place heavy load on Kubernetes API servers, so it’s vital to monitor quota limits with your cloud provider should you not be hosting on premise clusters.

Managing custom resources and operators also becomes a significant operational concern. You must define and document shared responsibilities across infrastructure, platform, and application teams. Operational processes such as cluster upgrades must be well-thought-out and follow best practices—like managing all configuration in source control.

Observability expands dramatically. The volume of metrics, logs, and traces increases with scale. You need to archive logs effectively and ask whether you should audit system calls if you are doing so. Certificate management and encryption (including key rotation) become critical, often necessitating service meshes for automatic certificate issuance and mTLS. Manual processes won’t scale—automation becomes mandatory.

You will likely be exploring different techniques and operators and discovering their limitations depending on your deployment model. Centralized tooling can become a bottleneck—for example, pulling artifacts from a single repository or funneling logs to a single destination. Distributed control may be necessary to avoid these constraints, such as using multiple Fluent Bit instances to aggregate logs instead of overwhelming a central log ingestion endpoint.

You begin to weigh tradeoffs between centralized and distributed models, each offering different benefits and challenges. Questions around maintaining application artifacts and dependencies at scale also emerge—you may be managing tens or hundreds of thousands of containers that require updates and patching.

Readiness and liveness probes take on new importance at scale. Readiness probes should validate end-to-end functionality, including downstream dependencies, to ensure the pod is truly ready for traffic. Liveness probes help maintain workload availability. As always, workloads must be treated as cattle, not pets.
Level 4: By Level 4, the platform vision and architecture are clearly defined. Experiments and ad hoc tooling from Level 3 are rationalized and replaced with strategic, standardized solutions. You may still have overlapping tools in use, but now you have a better understanding of their tradeoffs and suitability. Some of this clarity results from hard lessons and suboptimal solutions adopted under pressure at earlier levels.

Technical debt from Level 3 is acknowledged, and although paying it down is difficult, it now becomes a focused effort.
Level 5: At the highest level of maturity, your platform responds to events automatically. All security and operational data is centralized, allowing for coordinated and efficient action. The system is no longer reactive but proactive. Automated event responses, centralized observability, and full lifecycle automation make the container runtime environment robust, scalable, and secure.

Application Release and Operations

Managing a cluster with Infrastructure as Code (IaC) differs from managing application release and deployment, though many of the same techniques and tools apply to both.

Level 1: When starting with Kubernetes, it is important to gain as much hands-on experience as possible. You will become familiar with the Kubernetes API, write YAML manifests, and begin exploring configuration management and templating tools. Early on, you’ll be supporting dev, test, and prod environments, so it’s critical to evaluate tooling that can help you manage your manifests effectively from the start.

Version control systems—traditionally used by developers—become essential for release and operations in the cloud native world. Since releases are defined in code, they must be tracked, forming the basis for GitOps. This requires careful evaluation of branching models that align with your organization’s release policies. Ensure you understand the capabilities of your GitOps tooling and your release process. Remember the developer principle of “Don’t Repeat Yourself.”
Level 2: You are now using GitOps operators for rapid, consistent deployment across all environments. Controlling access to configuration repositories and ensuring they reflect what is deployed is critical. You are consuming Helm charts and upstream package artifacts to configure third-party tools.
Supply chain security techniques are being incorporated into your releases (see the Security and Policy section). Observability is now vital to operating cloud native applications. With increased flexibility in environment creation, new release paradigms—such as maintaining two production environments to allow upgrades at any time—can offer significant benefits.
You are adopting a “roll forward” approach to issue remediation, applying the last known good configuration to the cluster. Using resources as feature flags (e.g., with Kustomize) is an effective strategy for testing upgrades. Application teams are expected to maintain all components and third-party dependencies in their solutions.
If development and operations remain separate, there must be an approval process for promoting changes to production, with operations reviewing each release. SBOMs are required for any third-party applications, possibly as a contractual or licensing requirement. It is essential to follow strong security practices for both container images and Kubernetes deployments, and to document and share these practices. Robust configuration management accelerates testing and security patching.
Level 3: Developers are now responsible for their own releases, including developing their own continuous deployment pipelines. They use the same deployment process for dev, test, and production environments. You are placing greater emphasis on provenance and controlling what enters your clusters (see Security and Policy section).
Instrumentation is robust and includes tracing, observability, service meshes, and mutual TLS. Awareness of cloud provider offerings increases, and performance becomes a key concern. You must now balance performance and cost.
Sharing learning across the organization is essential to avoid perpetuating inefficient practices and technical debt. Upstream third-party tools often release updates as quickly as your internal platform, making it important to stay current with new versions, features, and best practices. As you scale, capacity limits in your initial tooling may surface, making capacity planning and deeper tooling utilization critical.
Level 4: You are actively securing the supply chain, and policies now govern both the release pipeline and runtime state. The Kubernetes API is used not only for container orchestration but also to manage other data center components having likely been extended with Crossplane and other technologies.
Organizations at this level can create and destroy production-ready clusters on demand and take advantage of beta and alpha APIs. Releases are automated, reliable, consistent, measurable, auditable, revertable, and quick. Artifacts are standardized and predictable.
Automation—including admission controllers—is used to validate workloads before production release. Security and policy controllers enforce defense-in-depth strategies. Kubernetes becomes a foundational platform component, and developers begin coding against infrastructure capabilities. Tools like Crossplane illustrate this evolution by integrating cloud and infrastructure lifecycles directly into applications.
Level 5: Code is released at the highest level of abstraction and with idempotence from the underlying infrastructure, enabling maximum velocity and minimal vendor lock-in. You are effectively managing controlling artifacts and addressing technical debt.

Continuous deployment to production is now in place, supported by a fast, controlled, and automated release pipeline.

Testing and Issue Detection

Testing and issue detection evolve significantly as organizations adopt cloud native practices. This section outlines how testing, observability, and operational readiness mature across each level, from manual validation to automated recovery and continuous validation.

Level 1: When just starting out, most testing is manual, focused on your initial production candidate application. With Kubernetes, your attention is on basic network connectivity and confirming that applications can be successfully deployed. You will perform smoke tests and user acceptance testing (UAT).
In Levels 1 and 2, the emphasis is on consistency in container image builds and the continuous delivery process. You’ll rely on existing tools for unit testing and static code analysis. Both functional requirements (e.g., application logic) and non-functional requirements (e.g., performance, capacity, and availability) must be considered.
You’ll begin implementing liveness and readiness probes and incorporating observability tools. It’s important to define clear service level agreements (SLAs) based on business and customer expectations.
Level 2: Now that you’re in production, you’ll begin experimenting with tools that support security, policy enforcement, misconfiguration detection, resource management, and observability—starting in staging or development environments.

Development teams are supported by platform and infrastructure teams for environment management. Tooling decisions should prioritize customer-impacting functionality. For example, don’t focus on low-priority policy controls if customers are experiencing latency issues that could be addressed by monitoring through a service mesh.

You’re actively prioritizing based on business needs and customer satisfaction. Production feedback becomes a valuable source of insight. Metrics should be tracked and visualized from both platform and application sources. While logging may be challenging at this stage, it is essential for effective troubleshooting. You may also start evaluating tracing tools.

This phase can be both exciting and challenging, as team members gain production experience at different speeds. Consistent deployments make testing easier, and a strong release pipeline improves issue remediation.
Level 3: Building on your tool and process experimentation, you now implement these practices in production. You establish robust alerting and dashboards, expanding your observability capabilities. Consistency in builds and deployments supports reproducible testing.

Development and test environments become shorter-lived and are spun up or torn down based on business requirements, following consistent specifications. These environments may be entire clusters or, for cost efficiency, isolated namespaces within a single cluster. You’ll begin investing in user interfaces that allow developers to create and destroy environments easily.

To manage the overhead of dynamic environments, automation is key. You are scaling your build pipelines, leveraging platform capabilities such as cost-efficient regional placement, resource management, and parallelization to increase build concurrency and support more comprehensive testing.

Vulnerability scanning at the container and cluster level, along with SBOMs, enables effective identification of dependency-related security issues. You’re also becoming more aware of the tradeoff between developer velocity and test coverage, including how frequently and which types of tests are run.
Level 4: Issues may now span multiple applications, requiring you to aggregate data across systems to identify trends. You’ve established consistent patterns for validating all environments—including production—and can create and destroy them with ease.
This includes platform considerations like load balancing and secret management to ensure environments feel seamless and plug-and-play. Data consistency across environment instances becomes increasingly important.
Recovery processes are now integrated into standard operations, including chaos engineering to simulate failures and validate system resilience. This helps ensure non-functional requirements like availability and fault tolerance are continuously met in practice.
Level 5: At this level, both platforms and applications recover automatically and immediately. Restoration to a known good state is always possible and predictable. You have strong test coverage for both functionality and quality of service, and testing occurs as early in the development lifecycle as possible. Immutability and idempotency principles ensure systems can consistently return to a reliable state.

Security and Policy

This section outlines how security and policy practices mature alongside cloud native adoption, starting with basic IAM and secret management and evolving toward automated, policy-driven platforms. As organizations progress, they implement defense-in-depth strategies, enforce compliance through policy as code, and continuously optimize security in response to changing threats.

Level 1: Begin building your secured CI/CD pipeline if you haven’t already, and remember that your current practices with VMs will evolve significantly. You have developed an Identity Provider and Identity and Access Management (IAM) infrastructure, integrating it into your clusters using tools such as RBAC and service accounts.
Following the 12-Factor principles, configuration is stored in the environment—including secrets, which are base64-encoded (not encrypted)—to allow stage-specific configurations (e.g., dev, test, prod). Much more can be stored in the environment, enabling immutable images and a strong separation between application and configuration. Avoid embedding environment-specific information, such as credentials, directly in container images.
You are becoming familiar with the Kubernetes API and are aware of its users. You also understand Kubernetes’ flat networking model, where all pods can connect to each other by default, with no inherent workload isolation.
Level 2: Ensure that development and operations teams follow best practices for container, secrets, and security management. In production, you must address encryption, authentication, and authorization. This includes certificate management and a functioning CA infrastructure that can issue certificates to running pods.
You are implementing secret management tools and automation. TLS or mutual TLS is being deployed at the cluster level and between pods—especially for sensitive workloads. A service mesh may be considered to enhance traffic visibility and manage network security features.
Your systems are auditable, with logs and events captured and retained. Generic accounts that are not traceable to individuals (e.g., “administrator” or “kubeadmin”) are not used, in contrast to service accounts used by software to access resources.
You may limit service exposure to load balancers and restrict network access to the production cluster to prevent unnecessary or unexpected exposure. These measures are especially important in multi-tenant clusters (e.g., Namespace as a Service).
Access policies are expanding to include source control, automation components, and dependencies used for managing clusters. You are extending GitOps to the platform layer, ensuring consistency and convergence to a known state, and reverting any drift—whether malicious or accidental.
You are beginning to restrict API access, and validating incoming requests using admission controllers, possibly including mutating admission controllers.
Level 3: Now is the time to automate deployment guardrails and platform components like certificate management while implementing security best practices through policy as code. Define your enforcement strategy and begin adopting relevant third-party benchmarks and standards. Consider incorporating anomaly and threat detection technologies.
As production environments grow more complex, some issue remediation may require changes to your policy-as-code, Infrastructure as Code, or application code.
You are evaluating SPIFFE/SPIRE as you move towards a zero trust model for security, and are well underway with certificate and trust store automation, and service mesh integration. Admission controllers now read from your policy platform, enforcing organization-wide or application-specific rules.
You are scanning container images, identifying and addressing CVEs, and maintaining SBOMs to provide provenance. You aim to meet SLSA Build Level 1 requirements. Machine learning may also be introduced to enhance threat detection practices.
Level 4: Apply your security policies to production, if you haven’t already, and continue tuning them. You have implemented measures to reduce attack surfaces, such as preventing manual pod access (e.g., no shell inside pods), while providing safer alternatives with full audit trails (e.g., Falco).
You’re improving security posture by removing the need for insecure workarounds or legacy practices. At this stage, you are working toward SLSA Build Level 2 compliance.
Level 5: Security policies are continuously optimized in response to evolving threats and business requirements. Exceptions are minimized and formally controlled. You are working toward compliance with SLSA Build Level 3.

Cost Efficiency, Resource Usage and Sustainability

This section outlines how efficiency and sustainability practices evolve from basic resource tuning and cost awareness to full optimization of workloads across architecture, geography, and carbon impact. As organizations mature, they incorporate FinOps, sustainability reporting, and developer accountability to achieve peak efficiency and minimize wasted resources.

Level 1: A common temptation at this level is to focus solely on containerizing and deploying workloads, without tuning resource requests and limits. Addressing these early yields long-term benefits. Developers will be involved, particularly in sizing memory allocations (e.g., Java heap sizes).
Level 2: The primary focus is reaching production. This often results in cluster sprawl across environments, leading to increased costs and operational complexity. You begin limiting resource consumption and may explore multi-tenancy (e.g., Namespace as a Service) or consolidating environments (e.g., dev and test in a single cluster with separate namespaces).

You might evaluate CNCF projects like Capsule to reduce the overhead of running multi-tenant clusters. Chargeback and FinOps capabilities are introduced in a basic form, such as tracking CPU and pod requests or applying quotas at the namespace level. These practices will become more granular as you mature. You also begin experimenting with vertical and horizontal pod autoscaling—initially based on platform metrics, with early exploration into application-level metrics to inform scaling decisions using KEDA.
Level 3: Sticker shock may occur here (or even earlier). You begin consolidating workloads and resources more aggressively, potentially using spot VMs and deploying in lower-cost regions, while balancing availability and performance. Idle development and test workloads are scaled down outside business hours where possible.

You become more aware of underlying infrastructure—such as machine types, chipsets like ARM or RISC-V, and specialized hardware like GPUs. FinOps becomes essential as multiple teams run production workloads. Cluster efficiency improves through approaches like Namespace as a Service, and you adopt tools to report and manage resource requests effectively.
Level 4: You begin reporting on carbon emissions and shift focus from pure cost efficiency to include sustainability goals. This includes carbon reporting and machine right-sizing, considering both size and architecture. Cost-benefit analysis becomes more rigorous—for example, evaluating whether a workload justifies GPU usage.

Developers now share responsibility for efficiency and are expected to optimize code based on the capabilities of the underlying platform and infrastructure.
Level 5: You’ve reached peak efficiency. Resource usage is optimized, FinOps reporting is mature, and your systems are highly sustainable—leveraging efficient chip architectures, immediate responsiveness, and minimal overhead. Wasted resources are minimized, and workloads are right-sized and deployed on the most appropriate platforms in the most efficient regions.

AI

The CNCF AI White Paper describes “Cloud Native Artificial Intelligence [as] an evolving extension of Cloud Native” that “…refers to approaches and patterns for building and deploying AI applications and workloads using the principles of Cloud Native. Enabling repeatable and scalable AI-focused workflows allows AI practitioners to focus on their domain”. In this context, cloud native works to solve with its scalability, resilience, observability and manageability many of the challenges that AI suffers.

Level 1: Starting out with initial development and experimentation, organizations explore basic AI concepts and conduct small-scale experiments, typically where the outcome is known using discriminative AI such as the classification of email. Developers working with AI will be working mostly on rapid prototyping and gaining access to resources such as storage, networking and processing for training (the process of building an AI model from data) and inference (computing results from AI models). Kubernetes facilitates resource access and model dependencies can be effectively managed through containerization and as OCI artifacts, models can be stored in registries and caching can be enabled.
Level 2: With the first production deployment of AI models, the emphasis shifts to ensuring the AI workload’s stability, basic scalability and service resiliency. Initial observability is important, and model drift needs to be tracked. Tools such as OpenTelemetry and Prometheus can also assist with monitoring load, number of accesses and response latency. Security is also a concern in production, with model serving instances requiring firewall production, access control, penetration testing, and compliance checks. Reaching production is a major step, but it’s not the last in increasing maturity.
Level 3: As the organization begins to scale, more complex AI applications like Generative AI and Large Language Models (LLMs) are introduced, and the focus shifts to standardizing MLOps processes and addressing inconsistencies between development and production environments. As such, if not already employed, then you may find yourself investigating solutions like Python SDKs for KubeFlow or general-purpose distributed computing engines like Ray (and KubeRay in Kubernetes). For more advanced Kubernetes scheduling needs, Kueue, Volcano and other non-CNCF tools may help. Data volumes and locality will also drive technical solution architecture. Rising processing demands from LLMs will lead to increased demand for accelerators such as GPUs, TPUs and other technologies. Given the cost implications, technologies such as that allow for sharing of resources such as vGPU and multi instance GPU may be investigated. Kubernetes development around Dynamic Resource Allocation may also help. In order to promote sustainability as well as reduce costs, models may be autoscaled for serving and placed into geographical regionals that are powered by cleaner energy.
Level 4: At this stage the organisation has achieved significant control over its AI operations, with mature MLOps practices and has formalized its cloud native design patterns for AI workloads. There’s a clear understanding of the AI supply chain, and policies are actively governing its security and runtime state. We see at this level the beginning of a symbiotic relationship where AI is used to improve cloud native systems themselves. Projects like K8sGPT can enter the hands of operators to enhance their productivity using LLMs for processing logs. This allows less technical users to operate complex systems. AI can also be used to identify workload patterns and anticipate load, and optimize resource scheduling for things like power conservation, resource utilization, latency and priorities. Best practices for the software supply chain should also be followed here. This will also include taking advantage of capabilities such as hardware-supported Trusted Execution Environments to protect sensitive data and valuable ML models. Security may also be enhanced by using AI itself as a ‘red team’ member for the identification of security gaps.
Level 5: In line with the overall drive towards efficiency, the AI lifecycle is fully automated and highly optimized through cloud native technologies, and is also deeply integrated into the operational fabric of the cloud native environment, with the organization achieving peak efficiency and sustainability for its AI workloads. Cloud native supports this through full infrastructure automation and continuous deployment for AI. AI itself is used for operational tasks, and there is optimised resource usage with full FinOps capabilities. Promoting environmental sustainability and developing energy-efficient AI models is crucial at this level.

Aspects:

Mon, 01 Jan 0001 00:00:00 +0000

Cloud Native Maturity Model - Prologue

The Cloud Native Maturity Model is composed of six separate documents - this document, the Prologue, and the five key reference documents:

The Cloud Native Maturity Model: A Framework for Your Success

The Cloud Native Maturity Model is designed to support you—whether you’re just beginning your cloud native journey, leading a team of practitioners, or are already an experienced expert. This model helps identify where you may need to invest in tools, processes, people, or policies. Most importantly, it bridges the gap between technical goals and business outcomes, enabling you to effectively communicate the value of your cloud native strategy to organizational leadership.

Developed by practitioners who have guided numerous organizations through cloud native transformations, the model addresses a common challenge: starting without a clear roadmap. Its purpose is to provide a structured, practical framework to guide your journey—from initial adoption to full maturity.

By aligning with the CNCF landscape, the model helps you unlock the full potential of cloud native technologies to build and operate scalable, resilient applications across public and hybrid cloud environments.

Cloud Native Maturity Model 4.0 (Beta) Now Available

The beta release of Cloud Native Maturity Model (CNMM) 4.0 is here. The Cartografos Working Group has updated the model to reflect the rapid evolution of the cloud native ecosystem since its initial launch in 2021 and subsequent updates in 2022 and 2023.

Version 4.0 introduces expanded coverage of AI, emerging technologies, and the organizational and cultural shifts required to succeed in today’s complex IT landscape. Critically, it continues to emphasize alignment between cloud native practices and business outcomes, ensuring organizations don’t just adopt technology—but do so with purpose and impact.

Cloud Native Must Serve the Business At its core, cloud native maturity must be driven by business requirements. No organization should adopt cloud native technologies without a clear connection to business goals. That’s why version 4.0 puts greater emphasis on what the business must consider to fully realize the benefits of cloud native.

Too often, technical conversations dominate cloud native initiatives—causing a disconnect between engineering and “the business.” But the business doesn’t care about Kubernetes for Kubernetes’ sake. It cares about managing risk, meeting compliance obligations, ensuring customer satisfaction and trust, and achieving cost efficiency.

While the CNMM still covers the full range of technology topics, it now begins each maturity level by framing the business and technology focus areas first. Then, it explores how these focus areas affect the four pillars: People, Process, Policy, and Technology. For the purposes of this model, “the business” includes anyone outside infrastructure, platform engineering, operations or development teams.

If you’re on the technology side, this model will help you translate cloud native concepts into business-relevant language—bridging the communication gap and building stronger alignment.

Shifting Left the Business Strategy It’s worth noting: in the early stages of maturity, technology naturally dominates. However, introducing policy, process, and enforcement earlier accelerates progress. This shift-left approach allows organizations to align technical decisions with business goals from the outset.

Importantly, cloud native isn’t always a guaranteed cost reduction—especially in the early stages. There’s a financial maturity journey involved. Organizations moving from CAPEX-heavy infrastructure to cloud-based OPEX models must rethink how they view cost. Early on, expenses may rise due to the need to maintain legacy systems while launching cloud native platforms.

That’s why FinOps practices are essential—particularly for enterprise-scale organizations. FinOps can serve as a strategic ally, helping finance and engineering collaborate to manage and optimize cloud spend as maturity grows.

The CNMM is designed to work alongside the Platform Maturity Model, offering both a top-down and bottom-up perspective. Together, they provide a layered view that supports strategic planning and tactical execution. This release also includes links to architectural references and resources from CNCF Technical Advisory Groups (TAGs) and other working groups.

We hope this beta release serves as a cornerstone reference for end users navigating cloud native adoption. As always, we welcome feedback and contributions to continuously evolve the model.

Where Are You in Your Cloud Native Maturity?

We surveyed the CNCF community to understand where organizations see themselves on the cloud native maturity journey:

39% say they are in the middle of their journey
19% are just getting started
15% believe they’ve reached maturity
Another 15% fall within Level 2 and 12% within Level 4

These responses highlight that organizations are at all stages of adoption, reinforcing the need for continued education, resources, and frameworks to guide progress through the many phases of cloud native maturity.

When asked if they have the resources to make cloud native adoption successful, 54% said they need more support—a clear signal that many teams are still under-resourced.

Only 15% of respondents say their cloud native efforts are completely aligned with overall IT strategy, while 39% see some alignment, and 8% report almost none. Bridging this gap is critical for success.

When it comes to business drivers for cloud native initiatives, here’s what respondents identified:

Agility – 85%
Scalability – 77%
Cost savings – 65%
Innovation – 46%
Customer experience – 27%

It’s important to note that cost savings—though commonly cited—may not be fully realized until later stages (Level 3 or 4) of the maturity model. Early investments can temporarily increase spend before long-term efficiencies are achieved.

Target audience

The main target for this model is broad and encompasses the following groups:

Businesses that are embarking or starting down the path of digital transformation
Those who want to navigate the massive CNCF landscape to hone in on a framework model you can implement and trust
Open source and CNCF projects and practitioners wishing to use or contribute to the model
Leadership teams looking to understand the benefits of cloud native, scope of effort, and level of investment
Technologists wishing to get started with moving towards cloud native technologies who are keen to understand in more detail the journey ahead of them, as well as have further areas for investigation highlighted

How the model is divided up

Developing a cloud native maturity model is not just a technology journey, but one which is influenced by five major areas:

Business outcomes - What can the business expect to achieve from cloud native? How are you going to communicate the benefits to the CXO and/or business leadership?
People - How do we work, what skills do we require, what does our organization look like as we move through this process, and how do we weave security into how people work?
Process - What processes do we need, what technology is required and how do we map workflows and CI/CD using infrastructure as code (IaC) and how do we shift security as “far left” as possible?.
Policy - What internal and external policies are required to achieve security and compliance mandates? Do these policies reflect your business’s operating environment?
Technology - What technology is required for you to deliver on the benefits of cloud native and support people, processes and policy as well as the technology for CI/CD, adoption of GitOps, observability, security, storage, networking, etc.

But what if we don’t fit this model…

Relax! No project, organization or person is expected to match all of the details contained within the model, perfectly. It’s deliberately designed to cover many different scenarios; everything from startups to Fortune 100 companies. Take what is most relevant to you and your situation, and if this helps inspire you in (or indeed account for, but then rule out) any items or areas, then we consider this to be a success for you!

The aim of this model is not to be overly prescriptive, but rather to be a tool to help guide you on your journey. Cloud native transformation is not an exact science, but rather lives within your project, your organization, and of course takes place in a specific time and place.

Prerequisites for the Cloud Native Maturity Model

The first, and arguably most important, thing to do when adopting cloud native is to outline your business and technology goals, particularly what your business expects to gain from the exercise.

Few organizations start out with an entirely blank slate (often known as a greenfield). You may have something like the following:

Your organization may range in age from a few months, years, decades or even longer. And may have a collection of technical debt.
You may have a considerable application, platform and infrastructure estate.
You may even have started a migration to cloud service providers, perhaps adopting a ‘lift and shift’ approach with your existing estate.

The most important thing you should have is a clear idea of the business outcomes you expect to achieve. These will be your ‘north star’, helping guide your decision making process.

When is the right time

You may be ready to start your cloud native journey if you meet the following criteria:

Business Outcomes

Whether you are a startup choosing to build on the cloud or an enterprise organization adopting cloud native, there must be long term business goals that require the investment in cloud native. These goals may be derived internally from corporate strategy, or from outside from industry trends, or competitive market pressures.
Prioritized business goals must drive the decision making with all stakeholders aligned.
Organizations should have established meaningful processes for sharing information and results between business units.

People

You have significant separation between development and operations, with a clear staff delineation between infrastructure, cloud, application operations and development.
You have traditionally split your operations and infrastructure divisions and your application development departments. This may have been enforced by regulatory requirements.
This split may have worked well for you, and indeed may be mandated. But you may be finding additional challenges as much of your platforms become code and application oriented. You may find you require skills in your platform area that have traditionally belonged within your application area.

Process

Your application deployments may be done manually in many cases, or your release processes may take a very long time to complete, often with multiple attempts.
You may support multiple distributions of the same software and have trouble upgrading or evaluating without significant down-time.

Policy

Policy may be in the form of conventions and rules that are located external to the application and its platform, and are not enforced natively within your applications and runtime environment.
Policies might be disparate and built in silos; defense in depth parity might be more of an accident than deliberate.

Technology

You’ll likely have VM’s on demand.
Some automation scattered around.
You will have baseline security components such as SIEM, RBAC concepts, and a directory of some type.
You have some software packaging, but this could be inconsistent.
You’ll have perimeter security and perhaps some coarse network zoning at layers 1-4, but you may feel some anxiety and security practices.
Your experience with encryption may vary - you might have some certificate authorities for example, but they may not be used extensively, with a high barrier to entry for many.
Your applications may rely on infrastructure solutions for high availability, which in turn may be more costly than you’d like
Your server estate could range from single physical or virtual servers with low levels of availability, through to highly available clusters. Scaling could be a real challenge and may require considerable investment in money, time and planning.
You may have started to dip your toe into a ‘Everything as Code’ model. i.e. started to script your infrastructure with Terraform.

The Cloud Native Maturity Model Journey

There are five stages within the cloud native maturity model. While you may be in stage five for one application, at the same time, you may be at stage 2 for another. Keep that in mind as you identify your stage of maturity.

Through level 4, you’ve likely stood on the shoulders of giants. At level 5, you become the giant—there’s no one left to stand on but you.

Level 1 - Build You have a baseline cloud native implementation in place and are in pre-production. Of importance, level one isn’t a lab or POC, you do have an implementation in place. It can be really hard to move from a build to operate stage.
Level 2 - Operate The cloud native foundation is established and you are moving to production.
Level 3 - Scale Your competency is growing and you are defining processes for scale.
Level 4 - Improve You are improving security, policy and governance across your environment.
Level 5 - Adapt You are revisiting decisions made earlier and monitoring applications and infrastructure for optimization.

In each of the following sections, we will highlight core concepts and discuss what this means in each stage of your maturity across people, process, policy and technology.

We welcome feedback from the community on the Cloud Native Maturity Model!

Cloud Native Maturity Model – The Five Aspects of Cloud Native Maturity

Aspects: Business Outcomes

Cloud Native Maturity Model Business Outcomes

Level 1: Build

Level 2: Operate

Level 3: Scale

Level 4: Improve

Level 5: Adapt

Aspects: People

Cloud Native Maturity Model - People

Navigation

People Introduction

Organizational Change

Teams and Decentralization

Upskilling

Security

Culture

AI

CNCF Certifications

Aspects: Policy

Cloud Native Maturity Model - Policy

Navigation

Introduction

Policy Level Overview

Policy Creation

Implementation and Compliance

AI

Aspects: Process

Cloud Native Maturity Model - Process

Navigation

Introduction

Audit and Logs

Software Integration and Release

Security

Aspects: Technology

Cloud Native Maturity Model - Technology

Navigation

Position on Included Technologies

Introduction

The Technology Overview

Architecture and Solution Design

Platforms and Infrastructure

Application Patterns and Refactoring

Container and Runtime Management

Application Release and Operations

Testing and Issue Detection

Security and Policy

Cost Efficiency, Resource Usage and Sustainability

AI

Aspects:

Cloud Native Maturity Model - Prologue

Navigation

The Cloud Native Maturity Model: A Framework for Your Success

Cloud Native Maturity Model 4.0 (Beta) Now Available

Where Are You in Your Cloud Native Maturity?

Target audience

How the model is divided up

But what if we don’t fit this model…

Prerequisites for the Cloud Native Maturity Model

When is the right time

Business Outcomes

People

Process

Policy

Technology

The Cloud Native Maturity Model Journey

Position on Included Technologies