Self provisioning network telemetry probes. I had almost forgotten about it.

6 min readApr 28, 2020

Network Telemetry presentation at Cloud Expo 2019

Yes, that happens when you write (probably by coincidence…) a good piece of python crap that just works, it does its job without complaining about errors and exceptions. I deployed it 2 years ago and almost forgotten about its existence in our infrastructure..until now. But let’s go in order and tell you all the story from the beginning.

As all network engineers on this globe have experienced, whenever there is a some kind of slow DNS response or HTTP error in some application, the first thing to be blamed (try to guess…) is always (and always be, no matter what) the network. For that reason, my manager assigned my first project as Network Automation Engineer! (old an sweet memories…). The idea was to build some kind of probes that could monitor a network path from Layer 1 to Layer 4, from one end-point to another, no matter if the end-point, was on public or private cloud. The most important thing was to provide some sort of self-service and self-provisioning way to configure and deploy probes in such a way that every team could run its own probe, on its own application, targeting whatever they liked. For example (every resemblance to real persons or facts is purely coincidental): a SysAdmin who wants to monitor the path between his/her DNS server and some root servers, or a DevOps who wants to monitor the network path across some applications on private cloud and a DB on public cloud.

An accessible UI for probes provisioning.

The first challenge was to provide an accessible user entry point for probes provisioning. That is, an easy way to define probe’s IP or FQDN targets. The most obvious option for me was to use a YAML file. A simple key/value pair where a user could define a key as the environment where the probe runs, and a value as list of IPs or FQDNs targets. Anyone needs to have any sort of knowledge in python or coding as YAML is quite self-explanatory. Can you try to guess the meaning of the below files?

The Python bit: main code and plugins.

Dreaming that my telemetry idea could be a success and people across the teams wished to have more features on their probes, I decided to design the code in main and plugin fashion.

The main code is the engine running a thread for each target (where each thread execute a command such as ping or TCP SYN) and pushing the data to DB (more on that later). The plugin is the bit of code that runs the actual command, parses the output, and builds the JSON body for the DB API call This is how the folder’s tree would look like with main and plugin.

.
├── classes
│   ├── __init__.py
│   ├── influx_body.py
│   ├── ping_alpine.py
│   └── ping_alpine_parser.py
├── network_telemetry_ping.py
└── var
    └── targets.yaml

The main (in the above example network_telemetry_ping.py) is isolated from the plugin (under class folder) and each plugin is imported into main

from classes.ping_alpine import Ping

This kind of design made the code scalable and capable to fulfill new requirements without requiring major changes in the code. New command == new plugin, simple like that. By the way, that kind of code design, is something I sticked with since that day and so far it seems to be a winning option.

The database. What to choose?

Very short introduction on this: DBs are always been the pain of my life. I’ve never really got along very well with them. End of introduction.

Said that, I had to find a good TSDB easy to use and work with. Googling around, I found InfluxDB (from InfluxData family) that was exactly what I was looking for: easy to deploy with docker, a REST API to interact with and SQL -like queries. So, I started a couple of containers running InfluxDB, configured some proper retention policies and I was good to go! My back end was ready in couple of hours (…well..let’s be honest: because of my repulsion for DBs, it took me more than a couple of hours).

The front-end: Grafana.

With no surprise, I found out that in our company we already had an instance of Grafana running and that was also widely used by all teams across different countries. So, the only thing I needed to do was to hook the DBs to Grafana. Doing so, I could have all data available to be displayed in nice graphs.

The glue: CI/CD for self-provisioning

Great! I had all the pieces required and they were working as supposed to. But a problem remained: how to ship the code wherever a user wanted to be shipped? And most importantly, how could a user do it himself without bothering (me) to deploy this probe here and there, targeting this or that?

The first question could be answered with Docker container: what I needed to do was to build a small docker container running the code. That container became what has been later called, a network probe. With a simple docker run (or even better docker-compose ) the probe could be up an running in few seconds (it is unnecessary to say that there is not an instance in private or public cloud that does not run docker…).

The second question (how to self-provision and self-deploy a probe) could be answered thanks to GitLab and CI/CD integration based on git runner. Even though I faced some limitations (addressed then in the latest GitLab versions), it was good enough for my purpose. So I built a pipeline where at every git push, a new docker probes were built with the latest targets imported from YAML, and then deployed wherever required. Click and forget. The remaining bit was to update manually the Grafana graphs (even though that could be possible pushing a JSON config file to Grafana API).

And that was it. End of story.

Conclusions

Oddly enough, I have never received complains in the past years regarding probes’ misbehavior. I later found room for improvement in the Pyhton code as well as in CI/CD process, but everything was (and still is) working so smoothly that I’ve almost forgotten about this project.

For fun, I also wrote the same code in GO and explored the power of goroutins. Code available here.

I also had the pleasure to present my work at CloudExpo 2019 where, for that occasion, I put all the work together in some slides (available here) and published a demo on git (available here)