Sunday, June 21, 2015

Monitoring and Visibility

I have been thinking a lot recently about the role of monitoring in a DevOps role versus traditional operations.

Monitoring has traditionally been the purview of Operations. To the extent that Development was involved in monitoring, it was to provide the applications metrics that could indicate a problem. These would then be sucked into a monitoring platform managed by operations and appropriate alert thresholds set up with the necessary run books for how respond.

With the evolution of DevOps, production support is often a joint responsibility. Lots of tools have emerged, providing more insight into the application from a user or code perspective and these metrics can be integrated into the monitoring and alerting platform. It is not uncommon to have different escalation polices for different types of alerts, some going to developers first, others to ops first and frequently passing between the two.

This collaborative effort can lead to a much quicker resolution of production issues but it is still reactive. How can the monitoring tools we implement assist in teams proactively preventing production issues?

Visibility is often talked about as an integral part of DevOps. Tools such as Ansible or Puppet pride themselves on giving a clear view into how systems are managed. Code Coverage tools provide visibility into testing. Continuous Integration systems provide visibility into build stability. All of this helps to increase confidence among those working together to build and release systems that those systems will meet requirements: functionality, performance, security and stability.

Monitoring Tools should be seen as an integral part of this effort to provide visibility. They should present information in a simple visual format that can inform the development process.

Developers working on a new feature for example can check systems metrics to watch for issues such as a spike in CPU or memory that may indicate code inefficiencies not covered by unit tests. User simulations can indicate problems with response times deep within a web application. Are these the result of new code, or are we just seeing them now because we are looking more closely? Either way it is probably something we want to fix before releasing to production.

Taking benchmark snapshots with each release is important. Measuring against these benchmarks can be incorporated into the unit testing. When we are confident that we have a good understanding of what is optimal performance of an application we can break the build if code changes degrade performance.

Thus our monitoring tools move from something that becomes important at the end of the process -- when code is deployed to production -- to an integral part of the development process itself, in true DevOps fashion.




No comments:

Post a Comment