August 28, 2023

min read

Test Environment Management

Read about the key considerations for successful test environment management, including success criteria, stakeholders, access control, and much more.

Test environment management involves the timely creation and deletion of the underlying infrastructure required to run tests that mimic production behavior. With an automated management system, teams can quickly receive feedback, iterate, and ultimately ship new features with huge improvements in efficiency.

Automated environment management provides safe, isolated environments for developers, product managers, team leads, and other stakeholders to ensure that business and product requirements are being met earlier in the development process where it’s easier to catch and fix issues.

Test environments have two prevailing models. The most common is shared persistent environments, while the preferred or best practice is isolated ephemeral environments.

Shared environments have multiple tests from different teams running on the same deployments and sharing the same resources. In those environments, teams share a couple of predefined, pre-provisioned environments to which developers push their changes for testing. This setup requires a lot of coordination and is prone to errors and rework because bugs are often introduced that impact the entire testing process across the whole team.

Ephemeral environments provide different isolated replicas of the same product features running separately for teams to run their tests on and can help alleviate many of the challenges that shared environments create. However, the ephemeral approach also has its own implementation challenges, which can be addressed by an environment-as-as-service (EaaS) provider such as Uffizzi which provides on-demand virtual clusters in a secure, multi-tenant fashion. When managed internally, these environments require proper setup and a great deal of infrastructure configuration and provisioning to ensure that they meet the required efficiency and security standards explained in this article.

This article covers the key concepts, success criteria, challenges, and best practices of managing test environments, especially for containerized applications hosted on a Kubernetes platform.

Summary of key test environment management concepts

The table below describes the essential factors that make test environment management successful.

Topic	Best Practices
Success criteria	Measure success metrics such as: Platform uptime Speed of instantiating and deleting environments Number of concurrent environments Graceful shutdown rate
Infrastructure	Use containerization and orchestration tools that allow for flexible deployments. Provide developers with automation to deploy changes with minimal intervention. Provide developers with isolated monitoring of the systems being tested.
Data	Allow for an easy-to-use interface to add testing data. Specify a clear retention policy for data in your test environment. Implement anonymization and data obfuscation techniques to avoid sensitive data leakage. Implement access policies for data in your test environment.
Access Control	Manage access via identity providers and role-based access control. Follow the need-to-know principle when providing access to individuals. Apply security hardening of the network layer through network policies. Provide tools for safe secret management across the organization.
Advanced testing	Provide an interface to add external dependencies such as APIs or data stores. Allow external infrastructure components to be integrated.
Observability	Provide a crash reporting tool that forwards errors to the teams’ devices of choice. Provide aggregated logs in a centralized dashboard, so teams can easily debug their services. Enable metrics and basic dashboards that allow teams to understand the performance of their services in terms of latencies, availability, and behavior under specific traffic loads.
Education	Simplify tooling and allow for less user-specific configuration. Educate teams on best practices to ensure a seamless testing process. Educate non-technical stakeholders on how to interact with the test environment. Celebrate successes and share metrics with leadership since a well-managed test environment is critical for maintaining the company’s speed of delivery.

‍

What is a test environment?

After developers finish implementing features, they often want to gather feedback and test them in an almost real-life scenario. Teams also want to run cross-functional user acceptance testing (UAT), which lets various stakeholders and end users test the newly implemented feature.

This process can range in complexity. Here are some examples of low, medium, and high complexities:

Low complexity: Reviewing a change in copy or design for a website update.
Medium complexity: Testing a frontend component that calls a backend API to add records to a database.
High complexity: Testing a checkout page that executes payments with different payment methods and communicates with a third-party system.

For the first scenario, we can deploy the client codebase in an isolated environment, review the changes, and make any necessary fixes. The second scenario is more challenging because we need to deploy client code, a backend service, and a database. This leaves us with three different components to deploy. Containerization can be used, as will be discussed in this article’s infrastructure section. For the third scenario, we will likely require everything discussed in the second, medium-complexity scenario in addition to a call to a third-party API to execute payments, with limited control over what the API returns. An environment is needed to provide a mechanism to configure these complex dependencies and provide clear metrics on how these systems behave.

Imagine 20 teams testing scenarios like numbers two and three using the same environment, such as a test, dev, or staging environment. The amount of coordination and rework would make it impossible to predict these systems’ behavior. An ephemeral approach is necessary because without it, teams will face issues with flaky tests, stale code being executed, and undesirable side effects.

Stakeholders in test environment management

Test environments play a crucial role in the software development life cycle. Typically managed by a platform team or an environment-as-a-service (EaaS) solution, these environments relieve the burden of infrastructure management from the developers.

The customers of an internal or outsourced ephemeral test environment system are the development, product, and quality assurance (QA) teams responsible for developing and testing features. They require a mechanism for executing tests that is fast and predictable in order to release product features with confidence that they will deliver business value. The developer platform also helps teams identify and fix bugs early in the development cycle, saving time and resources.

Company leaders also play a role in this process, as they are the stakeholders primarily responsible for allocating testing resources, making decisions on application and testing infrastructure, and evaluating delivery speed.

In the end, all individuals involved in this process have a common goal: speedy delivery with the least possible friction. For many organizations, an existing EaaS solution that meets high standards is a worthwhile investment that enables product teams to focus on delivering their features instead of worrying about the testing infrastructure. This, in turn, leads to faster time to market and more satisfied customers.

Success criteria for test environment management

To effectively serve the stakeholders mentioned above, it is necessary to have metrics that measure the performance of the test environment. These metrics will be a source of truth for the test environment owners to improve the solution and demonstrate its value to leads.

The table below lists some metrics that stakeholders can use to evaluate the success of a test environment.

Metric	What it measures	Description
Platform uptime	Availability	High uptime indicates that the infrastructure is managed well to minimize disruptions.
Environment spin-up time	How quickly new ephemeral environments are created	A low spin-up time indicates that the system is efficiently managed, eliminating unnecessary delays in the testing process.
Environment deletion time	How quickly ephemeral environments can be taken down when they are no longer needed	Faster deletion times indicate that unnecessary resource consumption is minimized.
Concurrent environment capacity	The number of ephemeral environments that can exist at one time	Higher capacity means that your testing infrastructure can support many developers with reduced waiting time for new environments.
Resource utilization / Cost Saving	How efficiently resources such as CPU and memory are utilized in each ephemeral environment and having a time or trigger-based deletion to reduce overall $$ costs.	Lower resource utilization rates indicate that unnecessary resource consumption is minimized, making the most of infrastructure expenses.
Graceful shutdown rate	The percentage of environments that the team gracefully shuts down	A high graceful shutdown rate indicates that resources are efficiently managed and that the risk of potential data loss or downtime during the shutdown process is minimal.

‍

Test environment infrastructure

An ephemeral environment is an application stack deployment that can be easily created and deleted after use. Containers are the ideal platform for hosting the application stack components since they are designed to be spun up and torn down in seconds. One of the most common setups for such environments is Docker containers running on Kubernetes.

Using containers on an orchestration platform offers several advantages for ephemeral test environments, including the following:

Resource management: Orchestration platforms manage the cluster’s CPU, memory, and storage resources. They ensure that containers have the resources they need to run effectively, and they can limit resource usage to prevent overconsumption.
Scheduling: Orchestration tools like Kubernetes decide which node containers should run based on resource availability, workload requirements, and user-defined constraints. This helps balance workloads and optimize resource utilization across the cluster.
Scaling: Orchestration platforms enable horizontal and vertical scaling of containerized applications to handle fluctuations in workload demands. They can automatically add or remove containers or adjust resource allocations based on predefined rules and metrics.
Service discovery: Orchestration platforms facilitate container communication by assigning unique network identities (IP addresses or DNS names) and managing inter-container networking.

Once the underlying infrastructure is configured with a testing orchestrator (or a testing cluster) teams can begin running tests. The table below provides a high-level overview of how developers could use ephemeral environments to perform UAT on a newly developed feature.

Step	Description
Build code	Compile code and build different containers for the components the orchestration system will deploy.
Publish containers	Push containers to a registry that the orchestration cluster can access over the network.
Deploy the test environment	Mimic the same flow of deploying to production but in a test environment instead. Pull containers for every component used (e.g., a backend service and a database) and start them within the environment.
Share links with stakeholders for testing	Stakeholders can access the deployed version of the product and perform their testing.
Apply changes (if needed)	Repeat the first four steps after adding a fix to the codebase, and share the new version for further testing.
Clean up the environment	Delete the environment after testing is done.

‍
This diagram shows the state of the testing cluster when multiple teams are running their workloads in isolated containers in parallel.

An example snapshot of an ephemeral testing environment with three parallel environments

Test environment data, retention, and privacy

Managing data in an internally developed ephemeral testing environment on Kubernetes presents unique challenges, many of which EaaS providers address. Here are some of the critical issues and considerations.

Challenge	Best practice	Explanation
Data consistency	Implement data versioning and control mechanisms to maintain consistency across different stages and environments.	Ensuring consistency across multiple instances of persistent test data can be complex and resource-intensive because multiple developers and testers might work with the data simultaneously. Using ephemeral storage, which gets deleted and recreated every time the environment is provisioned, is easier to manage.
Privacy concerns	Anonymize or obfuscate sensitive test data to protect privacy and ensure compliance with data protection regulations.	Test data containing sensitive or personally identifiable information must comply with data protection regulations. Anonymization and obfuscation may need to be more foolproof, and additional measures (e.g., access controls) might be necessary.
Data security	Implement role-based access control as explained below.	Ensuring that data is accessed only by the code and individuals who need it can be difficult. Depending on the resources available, teams may choose to forgo security hardening for their testing environments and reserve those measures for their production environments.
Seed sample data	Use data generation tools (e.g., Faker, NBuilder, or TestDataGenerator) to create seed sample data.	Generating realistic and diverse seed sample data for testing environments requires considerable effort. If a data generation tool can provide sufficiently realistic, customizable, and complex data, it will save developer time.
Regulations and compliance	Utilize a combination of suitable access policies, data retention policies, and proper observability of interaction with data to maintain a secure and compliant environment.	Ensuring that data management practices are compliant with data protection regulations requires ongoing effort and resource allocation and is subject to changing regulations and requirements.

‍

Test environment access control

Compliance requires authorization, which involves determining who can access what to ensure a safe and compliant testing environment with appropriate access policies. When planning a test environment for containerized applications on a Kubernetes platform, take the following considerations into account:

Role-based access control (RBAC): Use RBAC to define roles with specific permissions and assign them to users or groups, granting appropriate access to stakeholders based on their responsibilities and job functions.
Network policies: Define policies to control traffic flow between containers and external networks. This restricts communication among components, minimizing the potential attack surface and limiting access to sensitive resources. Isolation is ensured at the data, code, and network levels.
Secrets management: Securely store secrets, such as API keys, passwords, and tokens, using Kubernetes Secrets or third-party secrets management solutions. Grant access to secrets on a need-to-know basis and limit exposure to reduce the risk of unauthorized access or leaks.
Ingress and egress policies: Manage inbound and outbound traffic to your cluster by defining ingress and egress rules. Limit access to specific IP ranges, ports, or protocols to maintain security and control over your test environment. Some teams might need access to third-party tools that require a network connection to external networks. A caveat here is what the networking layer of your infrastructure allows: Services providing dynamic IP addresses won’t follow the same security patterns, while other tools, like service meshes, do whitelisting and blacklisting on API calls and endpoints instead.
Single sign-on (SSO): Use SSO as an entry point to testing environments, allowing stakeholders to obtain the tokens needed to identify themselves and for the environment to understand the right level of access based on RBAC.
Virtual private networks: Some organizations use VPNs to protect their environments and ensure that systems are only accessible within their networks. However, a VPN is not sufficient by itself; it cannot be the only network security mechanism since fundamental access rights and permissions still need to be considered.

Test environment managers should follow the “need-to-know” principle, granting individuals only the level of access they need to perform their expected tasks. Here are some simple policy definitions organized by stakeholder.

Stakeholder	Reason for access
Developers	Access is needed to tune things aspects such as sample data with simple edit rights in the environments created for them.
Product managers, testers, and other stakeholders	They require view-only access to the environments in which they are testing.
Platform Teams	Platform teams need admin access to all environments and access to the entire infrastructure to investigate issues, ensure security, and deploy fixes when necessary.

‍

Note that while they are not classically considered “stakeholders,” the software systems involved in this process are also important as they do the deployment and teardown within the testing environment.

Advanced testing

Testing can become more complex as both the diversity and number of dependencies related to software components increase, so teams hosting containerized workloads on a Kubernetes platform may require dedicated test environments to meet their needs.

For example, individuals working on data warehousing, real-time data processing, or third-party app marketplaces might need help expanding their testing infrastructures beyond what was discussed above.

In these scenarios, organizations have historically chosen between utilizing Kubernetes namespaces (a cheap but weakly isolated option) and deploying entirely separate Kubernetes clusters dedicated to each test. While deploying separate clusters—which may include message queues, caching systems, real-time processing systems, or distributed databases—provides strong isolation and better simulates the production environment than namespaces, this approach is costly and requires substantial effort.

As an alternative, some EaaS solutions like Uffizzi have developed the concept of virtual clusters, which are isolated clusters with their own API servers that run on top of an existing Kubernetes cluster. The primary advantage of virtual clusters is that they provide similar security, isolation, and provisioning capabilities as dedicated clusters but are less expensive, more secure, and faster to provision. Virtual clusters are designed to be lightweight, so they can be easily created and torn down, consistent with the ephemeral environment testing model. You can learn more about virtual clustering concepts by reading this article about Kubernetes multi-tenancy.

‍

Log and crash reporting for developers

One challenge of using an internally managed ephemeral test environment is its opaqueness. Developers deploy services in an environment to which they don’t have access if they are not a cluster admin, which makes debugging and viewing logs or metrics difficult. Therefore, observability plays a large role in ensuring the success of a test environment. While teams using internally managed test environments can accomplish better observability using a suite of tools, many EaaS solutions provide better observability through intuitive dashboards that allow team members to view logs and metrics and set role-based access controls in one place.

Whether using an internally managed or EaaS solution, there are several considerations related to observability to keep in mind:

Logging: The environment should collect, store, and analyze logs from applications, containers, and the underlying infrastructure to help developers identify errors, track application behavior, and troubleshoot issues.
Metrics: The environment should gather and visualize key performance metrics, such as CPU usage, memory consumption, network throughput, and error rates, to monitor the health and performance of your systems.
Distributed tracing: Distributed tracing should be implemented to track requests as they flow through microservices and other components in your system. This can help identify bottlenecks, latency issues, and other performance problems.
Performance profiling: Application performance should be profiled to identify areas for potential optimization and performance bottlenecks at the code level.
Synthetic monitoring: The environment should allow automated tests to be run to simulate user interactions and measure the performance and availability of your services from an end-user perspective.
Health checks and probes: Health checks and probes should be implemented to monitor the availability and readiness of your applications and services, and automated recovery should be enabled in case of failures.
Alerting: Automated alerting should be set up based on performance thresholds, error rates, or other criteria to notify stakeholders when issues arise. This allows for quick identification and resolution.
Crash reporting: Crash reporting should exist to automatically capture and analyze application crashes, exceptions, and errors. This provides developers with detailed information to diagnose and resolve issues.

Conclusion

Managing ephemeral test environments is a complex task that requires specialized skills and knowledge. Common challenges include a good developer experience, guaranteeing isolation and availability, maintaining IT infrastructure for reliability and scalability, ensuring security through access rights and data policies, and providing observability so that teams can identify and fix issues before they become significant problems. EaaS solutions like Uffizzi can significantly reduce the amount of resources required to build an effective and robust testing platform internally. This allows teams to focus on delivering business value and reducing time to market while ensuring that the test environment is reliable, secure, and fully functional.