SRE has found that roughly 70% of outages are due to changes in a live system.
Problem Knowing this, there is no need to look any further the reasons why SRE teams–or production team or whatever the team that will be called by angry customers–are so reluctant to change. If it’s not enough, just remind that their objectives are certainly based on the reliability of the services they maintain.
Some of my favorite quotes about software engineering.
Architecture If builders built houses the way programmers built programs, the first woodpecker to come along would destroy civilization. – Gerald Weinberg
There are only two hard things in computer science: cache invalidation and naming things. – Phil Karlton
Development Don’t comment bad code – rewrite it. – B. W. Kernighan & P. J. Plaugher
Refactoring is often compared to gardening; it is never finished.
I explain here how to interact with AWS either with the CLI (Command Line Interface) and with an IT automation tool: Ansible. Ansible is not the first tool that comes in mind for AWS (Serverless, Terraform or the built-in CloudFormation make more sense) however Ansible could be useful if you just want to configure some EC2 and specially if you have already an Ansible script somewhere around.
Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
– Principles of Chaos
This book is a bible for any professional who wants to deploy a solution in production–it’s the goal normally, not building throwable POC. It is a recognized reference since it has helped to popularize certain patterns such as the circuit breaker and it is at the top of all the must read lists in the domain. It’s full of good advices and feedbacks since Michael T. Nygard has worked in the field in question, which is now called operations (and even SRE), on critical applications–mainly, but not only, big e-commerce sites.
I’m using Pelican for another blog dedicated to books–no one is perfect. For several needs–an mainly because I’m a nerd–I have developed several plugins. And I have discovered that the Pelican plugin mechanism is based on a small framework called Blinker.
Blinker provides fast & simple object-to-object and broadcast signaling for Python objects.
The term 4 golden signals has been introduced by Google SRE team in the book Site Reliability Engineering1. The main definitions presented below are borrowed from this book.
The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.
1 - Latency (Performance) The time it takes to service a request, with a focus on distinguishing between the latency of successful requests and the latency of failed requests.
This book is simple and well organised. It addresses the key topics that need to be addressed if you want to build, deploy and operate large-scale applications. Here they are, I do not invent anything, they are the five sections of the book
Availability: learn techniques for building highly available applications, and for tracking and improving availability going forward Risk management: identify, mitigate, and manage risks in your application, test your recovery/disaster plans, and build out systems that contain fewer risks Services and microservices: understand the value of services for building complicated applications that need to operate at higher scale Scaling applications: assign services to specific teams, label the criticalness of each service, and devise failure scenarios and recovery plans Cloud services: understand the structure of cloud-based services, resource allocation, and service distribution Lee Atchison is very good at providing an overview of all these subjects.
A circuit breaker is a well known piece of technology used in — almost — every house. According to Wikipedia it is
designed to protect an electrical circuit from damage caused by excess current, typically resulting from an overload or short circuit. Its basic function is to interrupt current flow after a fault is detected. Unlike a fuse, which operates once and then must be replaced, a circuit breaker can be reset (either manually or automatically) to resume normal operation.
Static type checking could be one of the next feature to be included in the Python standard library. One of the main reason is that the lack of static typing is sometimes cool — you do not want to take care of it when you write small scripts — but you would be sometimes happy when your code base is growing to ensure that everything is fine without writing a test case for each line of code.
A categorical variable, as the name suggests, is used to represent categories or labels. For instance, a categorical variable could represent major cities in the world, the four seasons in a year, or the industry (oil, travel, technology) of a company. […] The categories of a categorical variable are usually not numeric. [..] Thus, an encoding method is needed to turn these non-numeric categories into numbers.1
Both Pandas and scikit-learn propose encoders to deal with categorical variables.
Erik Meijer1 in one of his lecture2 of the Functional Program Design in Scala course uses a so terrible example that I wanted to write it down. He used it to introduce reactive programming in Scala through the use of a Future[T] monad. But this is not the topic I want to talk about here. I just want to highlight, by reusing his example, the concept of scaling for humans.
I use Markdown as much as I can but mainly to write blog posts and documentation. There are many great editors – at least on Mac – for Markdown but most of them are not free to use. Moreover, Ulysses just began a move, that others could follow, to a subscription model (~$5/month). Finally, I don’t want to avoid vendor lock-in and charging $40/year for plain text editing in a format designed to be usable from anywhere by John Gruber and Aaron Swartz.
This article explains how to put in place quickly a basic monitoring of the Hadoop YARN resource allocation system through the ResourceManager REST API’s1. The same information is also available in metrics collectors like Ambari metrics (AMS), but YARN API is really easy to use and is available everywhere — and most of the time without requiring authentication. A last word, triggers (thresholds) on these metrics make efficient source to integrate into a global monitoring and alerting system, like Nagios or Zabbix, since they are not available out of the box in the standard Ambari alerting system2.
A Virtual Environment is a tool to keep the dependencies required by different projects in separate places, by creating virtual Python environments for them. It solves the “Project X depends on version 1.x but, Project Y needs 4.x” dilemma, and keeps your global site-packages directory clean and manageable. – source1
They can also be used to supply very different things like different versions of Python. Virtual Environments can be managed by the Python package manager pip, but when you are using the popular Anaconda distribution it is necessary to use the conda package manager coming with it.
While reading Data Analysis with Open Source Tools1–one of my favorite book on this topic–, I was surprised to discover that rounding numbers to display was described by a rule bearing the name of Andrew S. C. Ehrenberg.
It is obviously pointless to report or quote results to more digits than is warranted. In fact, it is misleading or at the very least unhelpful, because it fails to communicate to the reader another important aspect of the result–namely its reliability!
Definition An outlier is a data point or observation whose value is quite different from the others in the dataset being analyzed.1
It is an important part of the analysis to identify outliers and to use appropriate techniques to take them into account. But unfortunately 😔
There is no absolute agreement among statisticians about how to define outliers […]1
So what can be done? Fortunately 😏
The book Think like a programmer1 is interesting because it does not only focus on coding. One of the most interesting chapter is related to general problem-solving techniques. I found it so great, that I wanted to write down the rules, in order to remind me later these useful advices–cherry on the cake, they can be applied to almost every problem.
Always have a plan Restate the problem Divide the problem Start with what you know Reduce the problem Look for analogies Experiment Don’t get frustrated V.
Introduction Martin Odersky1 in his Functional Programming in Scala course illustrates higher-order functions2 with a simple example in Scala. Scala is a modern functional programming language where functions are first-class citizen so they can be used, like any other value, as a parameter and returned as a result.
// Simple sum function using a recursion to perform an operation on integers between a and b def sum(f: Int => Int, a: Int, b: Int): Int = if (a > b) 0 else f(a) + sum(f, a + 1, b) // Using anonymous function -- addict to syntactic sugar -- to define // The sum of integers between a and b def sumInts(a: Int, b: Int) = sum(x => x, a, b) // The sum of the cubes of integers between, a and b def sumCubes(a: Int, b: Int) = sum(x => x * x * x, a, b) > sumInts(1, 10) Int = 55 > sumCubes(1,10) Int = 3025 I love the Scala lean syntax–and its syntactic sugars–, but Python and R–my old friends–are also, among other paradigms, functional languages.
In my previous article on DevOps , I talked about an important tool in the process of enabling cooperative learning: Shared stories. But in practice what does it means? Here is a very simple proposal that can be applied almost anywhere and that do not require any tool–except a ticketing tool.
During team meeting each member of the team take some time to talk about a mistake, a problem or a failure he has made or he has to solve.