How do you decide when something is ready for production?

I am a Site Reliability Engineer with nearly 5 years of experience. I talk about Linux, Automation, Networking, and anything else related to tech and CS.
When an SRE team is taking service to production, they take over the responsibility of managing the service and would like to clarify all operational aspects to ease the journey. In my opinion, different organizations might choose to involve the SRE team might be at the earlier stage of development or a later stage. Still, ideally, it’s better to involve them early on to reduce the friction during go-live. Regardless of when the SREs are involved, here are a few things to consider before taking a service live (or into production):
Reliability: You need to have a discussion with the various stakeholders of the service: Developers, Product Managers, Sales, etc., about the expected uptime to define SLAs and about the expected performance may be to define SLOs.
Visibility: Discussion around the logs, metrics, and monitoring is imperative as they enable the measurements around the service.
Performance: It is necessary that a baseline of expectations is set and met while taking a service into production so the room for improvement can be defined early on in this journey. It also helps while setting up alerts if the performance tanks.
Capacity Planning: How much capacity would be required to meet the performance expectations or traffic estimates? Can we refer to any other service to decide the capacity for this one? Is it within our budget? These questions simplify a lot of discussion around capacity.
Emergency Response: Through books and experience, I learned the cardinal rule of being an SRE - no matter what, something might go wrong, and your service will fail. So it is best to define the firefighters. A couple of developers and SREs on-call with good training and documentation make the emergency response process much more manageable. And without the emergency response plan, you should not consider taking a service to production.
Change Management: This directly impacts your SLA, so you need to define how frequently new changes would be pushed and how to handle the service during the changes. There are various approaches to change management that you should look up.
Security: Based on the type of business or service, some security standards might need to be met. So, ensure that everything is clear in that dimension.
This is not an exhaustive checklist of everything that you should do while taking a service live, but it is a good baseline that I was able to chalk out through this article and my personal experience.




