Why the government should take a page from Google's IT playbook
COMMENTARY | Site reliability engineering could be the answer to improving the availability and usability of government digital services.
Government agencies encounter substantial obstacles in ensuring the reliability and accessibility of their IT systems for both citizens accessing public services and employees conducting government operations. Time and again, we see government websites and IT systems fail to keep up with the demand or fail to meet users’ expectations of reliability and availability. As more government services transition to digital platforms, the demand for efficient, reliable, resilient, and user-friendly systems becomes increasingly crucial.
While agencies have made strides in modernizing legacy application systems and the accompanying IT infrastructure through agile development methods and DevSecOps automations, reliability optimizations have not kept pace. A notable disparity remains in the management of IT operations once systems are deployed with a lot of “toil” in post-production operational tasks like system administration, service desks and other service management functions. As a result, government spends disproportionately higher amounts on the operations and maintenance portion of IT budgets, almost three times more than the development, modernization and enhancement portion in the proposed fiscal year 2024 budget.
It is time for a paradigm shift. Agencies must strive to deliver highly reliable and available services by proactively preventing incidents before they happen or effectively responding to incidents once they occur. This is where Site Reliability Engineering comes in.
Google came up with the concept of Site Reliability Engineering in 2003 to better manage operations of its massive distributed global infrastructure to minimize downtime and latency issues. Now, applying the SRE approach and its underlying practices has become a modus operandi for big technology and private sector companies. These organizations have significantly upskilled “traditional” system administrators to a fewer number of more effective Site Reliability Engineers to build and run large-scale, distributed, fault-tolerant systems.
SRE is essentially a set of practices and principles that are a result of taking an engineering approach to IT operations. SREs are not just involved during operations like typical system administrators do; they also play a critical role during development as well. SREs write software along with developers to automate tasks like upgrades, backups, load balancing, incident responses and other operational management functions. During the operations phase, SREs lead tasks such as capacity planning, monitoring, emergency response, change management etc. They approach these activities with a focus on improving systems to be more scalable, reliable, and resilient. With the SRE approach, as systems grow, there is no need to grow the operations teams in a traditional linear fashion since a lot of problems and toil are eliminated upfront, and smart automations are inherent within this approach.
Drawing from the success observed in the private sector, government agencies can gain valuable insights and advantages by integrating SREs within their development teams. This integration enables careful assessment of application components, design, and new releases to preempt potential reliability concerns. By incorporating SREs into the product release train, agencies can effectively uphold systemwide service level objectives.
While there are small pockets of government organizations that are beginning to adopt SRE principles and practices, there exists a tremendous opportunity to expedite and fully embrace a substantially superior approach to system design and management. Making the shift to SRE to enhance the reliability, scalability, and performance of government digital systems for enhanced services for citizens demands increased attention and executive commitment. To successfully implement SRE practices, government agencies must prioritize cultivating a culture of reliability, investing in training and skill development, collaborating with industry experts, incorporating in the concept of operations and making it a requirement in procurements.