DevOps Decision Making

In the last decade DevOps has gone from an obscure concept, to a full blown buzzword replete with marketing and sales strategies that have very little to do with the original spirit of the term. The consequences of this popularity are not all negative, even if the original intent is somewhat diluted. I believe that some DevOps is generally better than no DevOps, and most of the products marketed as “DevOps” tools or solutions do in fact solve a problem and can contribute to a DevOps strategy.

So how do you cut through the noise in this polluted landscape and figure out what has substance and what is hype? You can of course ask friends or colleagues you trust for recommendations, but many tools and frameworks are hyper specific and may not work in your situation or tech stack. I’ve written previously about DevOps Culture and while I firmly believe that any DevOps strategy must start there, tools still play an important part.

Picking The Right Tools

There are many ways that tooling may fall short of expectations, but one of the most common is when it simply isn’t designed for your use case. Trying to fit Docker and Kubernetes onto a legacy monolith system designed to run on huge Virtual Private Servers (Such as EC2 instances) will likely result in a subpar outcome for all parties. In a case like this, you must first re-architect the system if you wish to reap the benefits of containers.

This is the DevOps dichotomy in action. Many newer paradigms are inherently opinionated about how your application should be architected and run. This means that migrating to a conainterized architecture might not start in the Ops realm at all. Instead it may require significant Dev work before it can even run efficiently in containers. In the case of a large monolithic app ask yourself first, what benefits do I expect to receive from containerizing my application? Then consider the amount of Dev work required to make the application container ready and the amount of Ops work required to setup and run a containerized platform for your apps.

If the benefits of containerization do not significantly outweigh the costs of re-architecting the application, you might consider using developer time on feature creation and double down on ops investment in a non-containerized management system and platform. In many cases this is less appealing as the technology may be older and not as exciting, but it is usually the right decision nonetheless.

It’s often better to buy a tool and pay monthly fees to let someone else do the maintenance, except in cases where use of the tool causes friction which could be eliminated with a fully custom solution. Just make sure that the amount of friction caused by a pre-built tool outweighs the cost of development and ongoing maintenance for a custom solution.

If you were to evaluate a deployment tool which could do everything except one manual step in your deployment pipeline vs a custom solution which required no manual intervention, then the gating factor becomes the frequency of deployments. If you only deploy once a month then you’d likely just go with the prebuilt tool, but if you deploy multiple times per day then the custom solution starts to look much more attractive.

Developer time is probably your most scarce and expensive resource, so start thinking about it as an investment to be deployed wisely. When you do this the right tooling starts to become obvious – use whatever your developers like that saves them the most time. Usually the costs of a tool are trivial compared to the costs of setting it up and using it. If it’s a good tool you may get that investment back with dividends, but a bad tool can be equally costly with no future return.

Thinking About Vendor Lock-in

Vendor lock-in is an important issue to consider, but in my experience most people think about it the wrong way. When choosing how to run their infrastructure they will try to avoid vendor lock-in at all costs and end up spending countless hours doing things in a more difficult less scalable way as a result. Instead of avoiding vendor lock-in entirely, consider the probability that you will want to migrate away from a specific vendor in the first place, and the cost of doing so.

Weight the cost of migration against the cost of using a non-vendor specific alternative. If the non-specific alternative works just as well and has the same implementation cost, then by all means use it. But if it is going to require significantly greater upfront implementation work, or it doesn’t serve your needs quite as well then you should consider your options more carefully.

There are many reasons you might want to migrate away from a specific vendor. Some of the most common ones are outlined below.

  • The vendor might go out of business or discontinue support for the product.
  • The vendor could raise prices significantly.
  • The vendor could sell to another company, and the new owner might cut off resources to the product causing a degradation of service.
  • Smaller vendors might have trouble scaling up if their product is a success, causing issues with the service.

That list is far from comprehensive, and many issues will be specific to a single vendor or situation. But hopefully it gives you some ideas to think about as you evaluate the risk of a forced vendor switch in the future. If you determine that the cost of switching vendors is high and based on the above factors a switch in the future is likely, then you will want to more carefully weigh open-source or non vendor specific alternatives.

If the cost of switching is low (i.e. The vendors all use a common protocol or standard allowing a more seamless switch) then you should go with the best vendor for now and deal with changing vendors later if it becomes necessary. This decision making process applies to almost all provider level decisions you will make such as log aggregation systems, cloud providers, configuration management tools, monitoring systems and metrics visualization tools.

Let’s say you were evaluating log aggregation, and deciding between a hosted solution and a self hosted open source platform such as an ELK stack. You’ll find that the switching cost to change providers is low since you can use something like rsyslog to send logs to most providers as well as your own ELK cluster. This means that switching would normally be as simple as changing an endpoint URL in your configuration.

You should then weigh the likelihood that you’ll need to switch providers in the near future. This could be necessary due to pricing if your traffic scales way up for example. Many hosted log solutions become very expensive at large volumes.

Once you have a rough idea of the engineering cost to create and maintain your own ELK stack, the cost of switching providers and the likelihood of a change in provider being necessary you can make an informed decision and choose the best path forward.

In the case of a small fast moving startup with limited resources, you would probably choose to use a hosted provider for now and then switch to your own ELK stack or similar when the economies of scale and availability of engineering talent converge within your organization to make it more viable. A larger company however may find the the cost of a hosted solution would quickly dwarf the effort required to build their own aggregation solution and choose to make the upfront investment instead.

This paradigm would be different in situations where the cost of switching is high. In those cases the right decision is usually to invest upfront in a solution that you can maintain for the long term.

Finding Balance

Doing it right vs doing it quick. Pushing back on requests from developers without alienating them. These are all decisions that can quickly go wrong without a strong framework for decision making to fall back on. If a tool, process or policy is implemented without developers in mind, they will find find ways to circumvent it.

Strive to be a facilitator, not a gatekeeper. Focus on enabling and educating developers, figure out how to say yes as much as possible. If you find yourself saying no to developers regularly, you have placed yourself in a gatekeeping role and must find your way back to enablement.

If you’re in a position to make decisions and recommendations for development infrastructure, think of developers as your customers. Strive to say yes, but know how to say no when it’s truly needed. If you do need to say no to a request, come back with solutions and acceptable alternatives. Otherwise you become the bottleneck in the system.

Of course saying yes doesn’t mean you throw stability, security or compliance away. This is where you must find the most important balance. This balance is a dichotomy, one which may leave you feeling like you’re between a rock and a hard place at times. However if you’ve truly done your job well and are known for aiding and enabling developers, you can spend some of that hard earned good will when the stakes justify it.

A common scenario where this might occur is with proof of concept apps that suddenly need to go to production because sales found a paying customer. From a business standpoint the app works and the customer is willing to pay for it, so there is no problem. The developer may or may not agree with this, but will likely go along with it because they get a reputation boost for shipping a working app.

You may know that the app hasn’t been properly tested or vetted for security issues, but pressure is coming from executives to deploy it ASAP. In this scenario, it’s tempting to fire back with demands or simply say that the app isn’t ready. But this response is unlikely to build good will or get you the results you want. Instead, define an objective list of criteria that an app needs to meet in order to be deployable. Then evaluate where this one falls short.

Come up with an action plan to get the app deployable and an estimate on timing. When you come back with a plan like this, you are much more likely to be taken seriously and you’ll be seen as more helpful than someone who simply refuses to deploy the app. You’ll likely need to work with the developer of the app on some aspects of the plan, they will be more receptive if it is concrete and actionable.

If all of this sounds like it would be difficult to implement in real life scenarios, that’s because it is! Nobody said leading in a DevOps manner is easy. It’s certainly not something that will be enjoyed by everybody, but when it’s done well it acts as a force multiplier in an organization. Watching an organisation’s development velocity increase in tandem with developer satisfaction, while knowing you played a small part in that success can be a reward in itself.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.