The Perfect Production Environment

The production environment is where the latest versions of software, products or updates are pushed live to the intended users – almost the final stage of production. This is the part of the environment where the end-user can see, experience and try out the new product.

In a recent client event, Dzmitry Viaroukin, Principal Software Engineer and Jorge Garcia de Bustos, Technical Presales Consultant at Godel ran a webinar on the ideal production in terms of deployment ideal production looks like in terms of deployment, loggings, monitoring and tracing and explored what benefits it can bring to your business.

The importance of a great production environment

A great production environment ensures a faster turnaround time from ideas to working code in production. New features bring in the money faster and allow quicker checking of assumptions and theory. It also enables a tech team to build a strong engineering culture of root cause understanding, allowing for a more predictable delivery process such as timelines, roadmap and complexity.

Value for your product and company

The production environment offers a quick product feedback loop and provides a fail-fast delivery culture with experiments and A/B testing. Testing gives a deeper understanding of the product and how it works for your users and allows you to scale what is working and change what isn’t. Overall, product development teams gain a deeper insight into the system/user’s behaviour allowing them to make better predictions and assumptions for the future.

The perfect environment brings value to your company:

  • Agile principles come from “continuous attention to technical excellence and good design enhances agility”.
  • Excellence brings excellence: The DevOps culture makes the tech stack more attractive and helps retain and enhance the kind of engineers who value that.
  • Mistakes happen: When working on innovative tech and products, the focus is on learning from mistakes, overcoming and looking at what can be improved.
  • Anyone can be a target for malicious actors: Detecting unusual activity requires knowing what usual patterns look like through logging, monitoring, alerts, etc.

Deployment as delivery metrics

  • Deployment frequency: How often an organization successfully releases to production and is one of the main data-driven metrics of the Agile methodology.
  • Lead time for changes: How often an organisation successfully releases to production
  • Change failure rate: The percentage of deployments causing a failure in production. Planning for disasters that will impact your ability to run a production environment ensures you have appropriate solutions in place.
  • Time to restore service: How long it takes an organization to recover from a failure in production

Deployment strategies

Continuous deployment enables monitoring for all your apps and all the relevant components of your infrastructure. Setup of actionable alerts with notifications and/or remediation ensures quality through continuous deployment. You can continuously optimize with “Build-Measure-Learn” and track your KPIs or user behaviour metrics and strive to optimize them through planning iterations.

Blue/green environment

Blue-green deployment is a deployment strategy that utilises two identical environments, a “blue” and a “green” environment with different versions of an application or service. The benefits of the blue-green deployment are that it is simple, fast, well-understood, and easy to implement. However, the cost is a drawback to blue-green deployments as replicating a production environment can be complicated and pricey.

Canary deployment

Canary deployment is the second most famous strategy. In software engineering, canary deployment is the practice of making staged releases. In essence, canary deployments create gradual releases – the idea is to first deploy the change to a small subset of servers, test it, and then roll the change out to the rest of the servers.

What to monitor

Production Monitoring is an inspection performed regularly during production. This is an element that looks at what you measure, how you measure and keep up with KPIs and how you monitor a list of common things:

  • Cluster level components (VMs, nodes, nodes pools etc)
  • Managed K8S components (Api server, cloud controller, kubelet etc)
  • K8S objects and workloads (deployments, containers, replica sets etc)
  • Application level (your applications – pods, jobs etc)
  • External to K8S (resources that are not part of K8S but required for platform stability, scalability and management)

Alerting

Alerting is an important stage where you create your alert strategy, activity log rules, metrics alerts rules and log-based alerts. Effective alerting and monitoring require both a strategy and a good solution. It’s important to ensure that alerts are delivered in a timely manner while preventing as many false positives and negatives as possible.

Logging and Tracing

Logging anything can help add value when aggregated, charted, or further analysed. Overall, log anything interesting to the business. The more data you capture, the more visibility you have.

This visibility is achieved by structured logs, different levels of logs, logging the right things, making logs searchable, keeping logs simple and concise and categorising and grouping your logs. Most things should be logged, and challenges can arise if you don’t implement the right levels of logging.

In addition to logging the right things, it is important to remember not everything can be logged which includes sensitive information (e.g., secrets, PII) and source code and proprietary data.

Takeaways from the event

There is no set formula for creating and delivering the perfect production environment, it can depend on the project but there are always things you should consider in your projects such as 24/7 support, not deploying on Fridays, delivering a high level of deployment and deployment in isolations ensures everything can be deployed.

You should involve your whole team including the engineers who you could ask in order to obtain the right amount of information. An important takeaway is establishing a culture where the engineering team can create a product that can be released several times a day.