Many larger organisations use configuration managers such as Puppet and Chef to orchestrate their server environment from code, allowing for many tens, hundreds and even thousands of servers to be managed by only a handful of engineers.
Scale Vs Cognitive Load
As increasing number of servers are provisioned by fewer numbers of staff, it becomes ever harder to understand how these servers actually fit into the bigger picture. While a given machine can be inspected for the services it runs, eg HAProxy, Tomcat, Postgresql and so forth, it is less apparent how those services are used to deliver value to the team, nor how they fit in the dependency tree of larger service oriented platforms.
In a simple model if Machine A goes offline the typical questions to ask are who needs to know, and how quickly, what service breaks, and ultimatly how does this affect our customers? Quickly followed up by ‘how do we fix it’ and if you’re lucky ‘how do we stop it happening in the future’.
These questions are often answered by simply sending a message to whomever is ‘on-call’ and then relying on their knowledge of the platform to being able to recover from any arbitrary failure in the platform, usually at 4am from a laptop at home.
believe the questions you should be asking are; ‘Where are the choke-points in our architecture?’ in other words, which server(s) have the highest service impact if they were to fail.
My suggested approach to helping resolve these issues is to provide a dashboard view of the server environment that is always up-to-date, and represents not only the servers in your estate, but what services are operating on those servers, and how they interact with eachother.
The questions I’d like to be able answer are
- What servers are used to deliver
- Where do we make use of
- What is the impact of restarting
- How is
involved in delivering content for ?
I’m specifically not going to try and answer questions that can be solved by existing tools like Foreman or Puppet Explorer which are capable of returning the state of machines, but not necessarily their relationships to other parts of your system.
- Use PuppetDB to provide access to core Puppet state
- Using Exported Resources also expand upon relationships between services
- VM Hosts indicate the guests they’re running
- Guests announce the services they deliver
- Services announce their dependencies on other systems and the product they’re delivering
- Collate the data into a graph database (eg Neo4J / TitanDB ) maintaining relationships between services
- Create dashboard that takes the Node data and lets users explore the relationships
- Intitial import into Neo4J may be sufficient as a starting point with some cypher queries
A quick concept diagram showing a badly formed relationship between a HA proxy load balancer to three web servers each of which hit a specifi DB server which are in a replicated cluster (circular).