Make It Small

“Make it large” is the mantra these days, but when it comes to releasing software, think the opposite, make it small. The secret to a successful release is to break it into numerous small deployments; this serves a dual purpose, minimizes risk as well as gives you enough leeway to fix bugs before it negatively impacts end users.

Never wait for a feature to be complete for deployment. Break it into multiple smaller chunks and keep deploying these to production even though no one will be using it. Smaller deployments always win over one big bang deployment. Longer the code runs in production, more data points you have to analyze and take corrective steps.

animal-art-baby-1289845

Get articles on coding, software and product development, managing software teams, scaling organisations and enhancing productivity by subscribing to my blog

If you can test with real users without them knowing, there is nothing like it. If you put your mind to it, you will always figure out a way to do this. Taking an example, let us say you are building a chat feature for your application. You can stealthily release it without informing your users and send dummy messages on their behalf. The user does not see any chat related functionalities, but behind the scene, you create dummy messages and send them to one another. Apparently, this is how Facebook tested their chat before launch; this allows you to check your feature realistically in ways it will be ultimately used.

Another trick is to log the action without actually carrying out the intended side effect. Say for example; you are introducing an API rate limiter. As a first step, start logging whenever an API call hits the rate limit threshold. During this phase do not block the calls, the idea is to emulate the result; this lets you test out the feature with real traffic and fix any bugs before it starts negatively affecting your API consumers. Once you are confident of your approach, you can roll out the feature where you block offending callers.

Phased rollouts is a life saver. Release features to a small set of users, keenly observe your application and gradually ramp up to 100%; this minimizes the impact of bugs and lets you catch them early.

Another lifesaver is feature flags. The idea behind a feature flag is simple; you put all your features behind an on-off flag which you can toggle if and when needed. The ability to turn off features during their early lives gives you a lot of flexibility and room to maneuver and improvise.

If you are working on refactoring or replacing an infrastructure component, never do a stop the world cut off. Always run both the systems in parallel and only once you are confident of the new approach, do the switch over. Say, for example, you are moving your data source from MySQL to Mongo, start by writing to both the datastores in parallel. Then change your application to work with Mongo while still keeping the option to fall back to MySQL if needed. Put this behind a flag which you can toggle. Only once you are confident of everything working with Mongo, pull the plug on MySQL.

The common thread that holds all these strategies together is to plan your deployments such that they are small and gradual; if something goes wrong, it is not catastrophic. You increase the scope gradually as you observe and gain confidence.

As they say – Risk is what’s left over when you think you’ve thought of everything. A bit of planning goes a long way in reducing risk as well as giving an excellent experience to your users.

Photo by Magda Ehlers from Pexels

Now You See Me

In the modern software world, where micro-services are de rigueur, observability of systems is paramount. If you do not have a way to observe your application, you are as good as dead.

w-a-t-a-r-i-776085-unsplash

W A T A R I

The first step towards embracing observability is figuring out what to track. Broadly, we can categorize software observability into:
1. Infrastructure metrics.
2. Application metrics.
3. Business metrics.
4. Distributed tracing.
5. Logging.
6. Alerting.

Infrastructure metrics:
Infrastructure metrics boil down to capturing the pulse of the underlying infrastructure where the application is running. Some examples are CPU utilization, memory usage, disc space usage, network ingress, and egress. Infrastructure metrics should give a clear picture as to how well the application is utilizing the hardware it is running on. Infrastructure metrics also aid in capacity planning and scaling.

Application metrics:
Application metrics help in gauging the efficiency of the application; how fast or slow the application is responding and where are the bottlenecks. Some examples of application metrics are the API response time, the number of times a particular API is called, the processing time of a specific segment of code, calls to external services and their latency. Application metrics help in weeding out potential bottlenecks as well as in optimizing the application.

Infrastructure metrics give an overall picture whereas application metrics help in drilling down to the specifics. For example, if the infrastructure metric indicates more than 100% CPU utilization, application metrics help in zeroing in on the cause of this.

Business metrics:
Business metrics are the numbers which are crucial from a functionality point of view. For example, if the piece of code deals with user login and sign-up, some business metrics of interest would be the number of people who sign up, number of people who log in, number of people who log out, the modes of login like social login versus direct. Business metrics help in keeping a pulse on the functionality and diagnosing feature specific breakdowns.

Business metrics should not be confused with business reports. Business metrics serve a very different purpose; they are not to quantify numbers accurately but more to gauge the trend and detect anomalous behavior.

It helps to think of infrastructure, application and business metrics as a hierarchy where you zoom in from one to the other when keeping a tab on the health of the system as well as diagnosing problems. Keeping a check on all three ensures you have hale and hearty application.

Logging:
Logging enables to pinpoint specific errors. The big challenge with logs is making logs easily accessible to all in the organization. Business metrics help in tracking the overall trend and logging helps to zero in on the specific details.

Distributed Tracing:
Distributed tracing ties up all the microservices in the ecosystem and assists to trace a flow end to end, as it moves from one microservice to another. Microservices fail all the time; if distributed tracing is not in place, diagnosing issues which span microservices feels like searching for a needle in a haystack.

Alerts:
If you have infrastructure, application and business metrics in place, you can create alerts which should be triggered when they show abnormal behavior; this pre-empts potential downtimes and business loss. One golden rule for alerts is, if it is an alert, it should be actionable. If not, alerts lose their significance and meaning.

Both commercial, as well as open source software, are available to build observability. NewRelic is one of the primary contenders on the commercial side. StatsD, Prometheus and the ilk dominate the open source spectrum. For log management, Splunk is the clear leader in the commercial space. ELK stack takes the crown on the open source front. Zipkin is an open source reference implementation of distributed tracing. Most of the metrics tracking software have alerting capabilities these days.

If you already have microservices or are moving towards that paradigm, you should be investing heavily on observability. Microservices without observability is a fool’s errand.

Poor Man’s Anomaly Detection

You have a feature where if someone signs up on your product, you create a wallet for that person and top it up with complimentary money. Your organization swears by micro-services; hence sign-up logic is in one service and wallet creation and crediting is in another service. Once a user signs up, sign up service sends a message to the wallet service so that it can create the wallet and do the credit. To ensure the sanctity of the system, you have to make sure that the number of signups, wallets created and credits done match one another. Also, if these go out of sync, alerts need to be in place to take corrective action.

Since the two are disparate distributed systems, one way to achieve the above is to use an anomaly detector. There are off the shelf products for this as well as open source projects. If you do not have the time, need and resources to invest in deploying an anomaly detection system, having a reconciliation system is the way to go.

black-and-white-child-connected-265702

Reconciliation is deeply entrenched in the financial domain where it is a way of life. The technology world can borrow this and use it as a poor man’s anomaly detector. For the scenario that we started with, we run queries on the data repository of the sign-up and wallet systems at regular intervals. These queries fetch the count of sign-ups, wallet creations, and credits that occurred during the period. Once we have the numbers, all we have to do is ensure that they match. One can do this with a simple bash script; this is extremely simple to develop and deploy.

Reconciliation can play a role in all places where two-phase commit flows are involved. For example, most payment flows follow a two-phase commit process. You first deduct money from the user’s payment instrument and then fulfill the commitment. There is a good chance that post payment debit, your system dies not doing the fulfillment. Having a reconciliation system in place helps you to take corrective action in these scenarios.

Reconciliation is a simple way to achieve anomaly detection until you have the need and the resources to invest in a more robust distributed anomaly detector.

Ode To Queues

If you have a producer with an uneven rate of production and a consumer which cannot keep pace with the producer at its peak, use a queue.

If you have a workload which need not be addressed synchronously, use a queue.

If your customer-facing application is riddled with workloads which can be deferred, move these to a queue thus making the customer-facing application lean and mean.

Get articles on coding, software and product development, managing software teams, scaling organisations and enhancing productivity by subscribing to my blog

duck-3217049_640

Think of a queue as a shock absorber.

There are workloads which need to be processed immediately with sub-millisecond latency, and then there are ones where you have the luxury of taking time. It is advisable not to mix these in an application. The second kind of workload can be addressed by moving it to a queue and having a consumer process them.

For example, consider a scenario where you are consuming messages and persisting them in a data store. These messages are coming in at a variable rate, and at its peak, the data store cannot handle the load. You have two options. Scale the data store to meet the peak load or slap a queue in between to absorb the shock. Queue solves this problem in a KISS manner.

Queues enable applications to be highly available while giving enough room to maneuver. As long as the queue is highly available, the chance of message loss is almost nil. Since a queue is durable, you need not perfect your consumer’s high availability; you get leeway to manage.

With applications embracing microservices paradigm, there is a lot of API back and forth. Not all API consumption has to be in real-time. Whatever can be deferred should use a queue as the transport mechanism.

Queue introduces a bit more complexity into an application but the advantage it brings to the table makes it a worthwhile investment.

Anti features

When evaluating new technology, framework or library; a lot of importance is given to the salient features. While it is very important to know the positives, the negatives usually tend to be glossed over. Being aware of the shortcomings of a framework gives one the ability to anticipate problems down the road.

feedback-3239454_640

For example, let us take NoSQL databases. A lot of time is spent on singing paeans to the scalability, malleability etc of NoSQL databases while hardly thinking about the negatives that come with it.

Two simple techniques which give a good visibility on anti-features:
1. The very obvious one, Google for the shortcomings. Someone would have written a blog post on the interwebs highlighting how a framework or technology let them down. For example, take this post by Uber on how Postgres did not work as expected for them.
2. Comb through Github and/or JIRA peeking at the bugs raised and enhancements requested.

Both of the above will provide a good picture of the shortcomings. If you are evaluating a closed source proprietary technology, the above may not make the cut.

Once a mental note is made of the negatives, ponder on the scenarios where this might affect your usage. It helps to spend quality time on this as this will save one from a lot of future trouble.

If you think about this, this might sound very obvious but tends to be highly neglected. We get so caught up in the positives of something that the negatives tend to be ignored and this usually comes biting us back later.

Testing legacy applications

When contemplating on introducing automated testing in legacy applications, it is easy to get bogged down in terminology; unit testing, integration testing, regression testing, black box testing, white box testing, stress testing, etc. Quite a bit of time is spent in debates on unit testing versus integration testing, I have written about this before too.

A practical way to approach testing legacy applications is to first scope out the intention behind the test. Is it to test the behavior of a particular method, an API response or how an application behaves post an HTTP form submit? Next step is to jot down what and all has to be done to enable this. For example, if a database is involved, it can be mocked or a test database with bootstrapped data can be used.

software-762486_640

The gamut of changes needed to inject testability into an application that has never seen testing before should never be underestimated. The way you would structure testable code is vividly different from coding being incognizant of testing.

Take a look at the code below, how would you unit test getUser method without creating a database connection?

public class Foo {
    DbConnection connection = null;
    public Foo() {
        connection = <establish db connection>;
    }

    public User getUser(int id) {
        ////Query db and get user data        
        User user = new User();
        //Fill user with data from db
        return user;
    }
}

To mould this into testable code, DbConnection creation needs to be decoupled from object creation, like below:

public class Foo {
    DbConnection dbConnection = null;
    public Foo(DbConnection dbConnection) {
        this.dbConnection = dbConnection;
    }

    public User getUser(int id) {
        //Query db and get user data
        User user = new User();
        //Fill user with data from db
        return user;
    }
}

Since the DbConnection is independent of object creation, DbConnection can be mocked to unit test any method in the class. An application written without testing in mind would be replete with code like the above. Code patterns like these are one of the biggest hurdles in testing legacy applications.

Next step is to eliminate the resistance to testing. This would mean all the infrastructure and libraries needed to carry out testing are set up and a reference is readily available to follow. Bunch test cases into categories like unit tests, tests that need a mocked object, tests that need a mocked database, tests that need a database seeded with data, tests that need a web server etc. Post this, implement one test case for each of these categories. This will serve a dual purpose, the setup would be ready for each category and a reference readily available to others to emulate.

One aspect that is usually neglected is the effect of testing on the product release cycle. As a result of testing, more code, dependencies, and infrastructure is introduced which needs to be maintained. Along with working on new features, writing tests for these also has to be taken into account. While refactoring, it is not just the code that has to be refactored, even the test cases have to be refactored. This is a tradeoff between time to market, and maintainability and reliability.

Testing is no longer a chore it used to be, testing tools and frameworks have grown by leaps and bounds. With the advent of docker, headless browsers, Selenium etc; testing is very much within reach of most of the teams provided the intention is there and effort is put in.

SQS versus Kinesis

A lot of people are confused between SQS and Kinesis. In some ways, both act as a queue, but there is a massive difference between the two.

SQS is a queue, adheres to FIFO and promises at least once delivery.

Kinesis is a distributed stream processor. A simplistic and hand-wavy way to think of Kinesis is like one large log file; items that you write to the stream as lines in this log file. When you want to process the stream, you get a pointer to the log file. When you read a line from this log file, the pointer moves to the next line. Kinesis is stateless, as in, it does not maintain the pointer for you, it is up to your reading process to maintain this. What this means is that, say you are reading off a Kinesis stream, and your process goes down, when you bring the reader process up again, it will start processing from the start, not from the last line before the crash. There is no concept of popping items out of Kinesis, data is always there(expires after seven days), you manipulate the pointer to this data. Hence, if you want to reprocess the stream, you can replay i.e., you can start from the beginning and do the whole thing over and over again. AWS provides a client library for Kinesis which maintains the state for you. This client library uses dynamodb to persist the state.

This should give you a fair idea of when to use Kinesis and when to opt-in for SQS.