Void

Tag: software

Make It Small

“Make it large” is the mantra these days, but when it comes to releasing software, think the opposite, make it small. The secret to a successful release is to break it into numerous small deployments; this serves a dual purpose, minimizes risk as well as gives you enough leeway to fix bugs before it negatively impacts end users.

Never wait for a feature to be complete for deployment. Break it into multiple smaller chunks and keep deploying these to production even though no one will be using it. Smaller deployments always win over one big bang deployment. Longer the code runs in production, more data points you have to analyze and take corrective steps.

animal-art-baby-1289845

If you can test with real users without them knowing, there is nothing like it. If you put your mind to it, you will always figure out a way to do this. Taking an example, let us say you are building a chat feature for your application. You can stealthily release it without informing your users and send dummy messages on their behalf. The user does not see any chat related functionalities, but behind the scene, you create dummy messages and send them to one another. Apparently, this is how Facebook tested their chat before launch; this allows you to check your feature realistically in ways it will be ultimately used.

Another trick is to log the action without actually carrying out the intended side effect. Say for example; you are introducing an API rate limiter. As a first step, start logging whenever an API call hits the rate limit threshold. During this phase do not block the calls, the idea is to emulate the result; this lets you test out the feature with real traffic and fix any bugs before it starts negatively affecting your API consumers. Once you are confident of your approach, you can roll out the feature where you block offending callers.

Phased rollouts is a life saver. Release features to a small set of users, keenly observe your application and gradually ramp up to 100%; this minimizes the impact of bugs and lets you catch them early.

Another lifesaver is feature flags. The idea behind a feature flag is simple; you put all your features behind an on-off flag which you can toggle if and when needed. The ability to turn off features during their early lives gives you a lot of flexibility and room to maneuver and improvise.

If you are working on refactoring or replacing an infrastructure component, never do a stop the world cut off. Always run both the systems in parallel and only once you are confident of the new approach, do the switch over. Say, for example, you are moving your data source from MySQL to Mongo, start by writing to both the datastores in parallel. Then change your application to work with Mongo while still keeping the option to fall back to MySQL if needed. Put this behind a flag which you can toggle. Only once you are confident of everything working with Mongo, pull the plug on MySQL.

The common thread that holds all these strategies together is to plan your deployments such that they are small and gradual; if something goes wrong, it is not catastrophic. You increase the scope gradually as you observe and gain confidence.

As they say – Risk is what’s left over when you think you’ve thought of everything. A bit of planning goes a long way in reducing risk as well as giving an excellent experience to your users.

Photo by Magda Ehlers from Pexels

Knowing Versus Doing

Over-engineering is ripe in the software industry; this mainly manifests in three ways:
1. Needlessly complicated design.
2. Lift and shift engineering – Picking up technology and practices from other places without understanding the context in which it was developed.
3. Being trendy – Using frameworks and languages that are currently in fashion irrespective of whether one needs it or not.

I have written about this before.

Please do not take this prose as an argument for being sloppy and building crappy products.

architecture-building-building-site-224924

I firmly believe one of the reasons why this occurs is that people confuse knowing with doing. You being aware of something should not force you into using or implementing it. You might have a personal inclination towards a language or a framework; you might even believe that it is the best in the world, but that does not imply you rewrite your currently well-working production application in the said technology. You come across something new, and shiny does not mean it has to be part of your application stack.

When we see something novel and trendy, our brain actively tries to figure out ways in which we can make it a part of our lives.

Another arena where this plays out is in doing things if and when needed. Because you read a blog post touting Redis as the next best thing to sliced bread should not make you go about slapping Redis as a cache in front of your data stores. Your data size might be tiny that it entirely fits into main memory. Achieving five nines of reliability is a daunting task which takes a lot of engineering and effort. As a geek, it is a fascinating problem to work on, but your application may not need it. Just because you are aware of what others are doing to achieve this does not imply you too embark on this path.

If you ponder over this, you will realize this behavior is not only restricted to work but plays out in our personal lives too. When we come across something new and exciting in any domain of life, we try to adopt it irrespective of whether we truly need it or not, and when we do it, we go way out of line in justifying its usefulness to ourselves.

Open Source and Revenue

This is the second part in a series on open source software. In the first part, we examined why equating open source with “just” free is fool’s errand. In this post, we will explore the different avenues for revenue from open source software.

lucas-favre-489526-unsplash

Get essays about software development, management, productivity, and more by subscribing to the blog

Join 222 other followers

The first one is pretty straight forward – charge for support, maintenance, consulting, and custom development. Software takes an effort to understand and maintain. Either you can do it in-house or outsource it to an external firm. Big enterprises have specific needs which require custom enhancements. They also need consistent support and suggestions. There are a lot of companies which have used this model to generate revenue from open source software like Redhat, Percona, etc.

The SAAS option for open source software has gained immense traction in the last decade or so especially since the advent of cloud. Instead of you taking the pain to host and maintain, the company behind the software deploys and manages it for a recurring fee. Most of the popular open source software is available under this model nowadays. WordPress, MongoDB, ElasticSearch are some prime examples of this strategy.

Another revenue strategy is the open core model. The core is open and free, but features which enterprises need like security, high availability and user management are part of the commercial offering. For example, the core database could be open, but clustering and high availability might be available only in the retail version. InfluxDB uses this model.

Then there is the licensing play. Software licensing is nuanced and comes with a lot of baggage and restrictions. The open source version of the software is released under a restrictive and commercial antagonistic license. If you want to use the software in a business setting, you have the option of buying a more permissive and commercial friendly license; this is very prevalent in software bundled in commercial products.

It is not uncommon for a company to use a mixture of the above strategies.

In the next part of the series, we will go through some recent developments in the open source world in an attempt to ward off the threat from big cloud providers like AWS.

Image Credit: lucas Favre

Open Source != Free

This is the first post in a series on open source software. You can read the second post here.

One of the most common conflations I see people making is mistaking open source software for free software; both are not the same. Being free is just icing on the cake, the more significant advantage is the freedom and flexibility that comes with open source software.

Let us say you are an enterprise with millions of dollars that has built its entire product on a closed source database. Your business is now profoundly entwined with the success of the database company. What happens if the database company goes kaput and shuts down? You now have to replace the database. Depending on the complexity of the product and the business, this might take significant effort and end up derailing your entire business. Open source software severely mitigates this problem.

beans-close-up-colors-1446267.jpg

 

Get essays about software development, management, productivity, and more by subscribing to the blog

Join 222 other followers

There is no concept of shutting down in open source world. Open source software development is inherently decentralized. In a lot of cases, committees govern open source software development. These committees have many stakeholders whose best interest is in keeping the software alive. Apart from this, many boutique firms provide development and maintenance services. All this leads to a robust eco-system that prevents a project from abruptly shutting down and taking you hostage.

Commercial closed source software reminds me of a famous line from the Eagles song Hotel California – You can check out any time you like, but you can never leave. Once you are locked into a piece of software, it is not easy to get out. During pricing discussion, there is an asymmetry. As a locked in customer, it is difficult for you to leave, there is no BATNA. Open source software does not have this problem.

Having access to source code is a huge advantage. When I was building Kwery, I used Apache Derby as the database. I started seeing a weird bug in Kwery which lead me to tinker with the Apache Derby source and finally unearthing a bug in the Derby database. A couple of mail exchanges on the Derby mailing list confirmed this. If I did not have access to the source code, there would be no way for me to figure this out.

I am not saying that open source software is a panacea to all problems and that you should completely shun commercial closed source software. Each has its place but equating opensource software with only free is folly.

You can read the second post here.

Image credit: Photo by Adailton Batista from Pexels

Now You See Me

In the modern software world, where micro-services are de rigueur, observability of systems is paramount. If you do not have a way to observe your application, you are as good as dead.

w-a-t-a-r-i-776085-unsplash

W A T A R I

The first step towards embracing observability is figuring out what to track. Broadly, we can categorize software observability into:
1. Infrastructure metrics.
2. Application metrics.
3. Business metrics.
4. Distributed tracing.
5. Logging.
6. Alerting.

Infrastructure metrics:
Infrastructure metrics boil down to capturing the pulse of the underlying infrastructure where the application is running. Some examples are CPU utilization, memory usage, disc space usage, network ingress, and egress. Infrastructure metrics should give a clear picture as to how well the application is utilizing the hardware it is running on. Infrastructure metrics also aid in capacity planning and scaling.

Application metrics:
Application metrics help in gauging the efficiency of the application; how fast or slow the application is responding and where are the bottlenecks. Some examples of application metrics are the API response time, the number of times a particular API is called, the processing time of a specific segment of code, calls to external services and their latency. Application metrics help in weeding out potential bottlenecks as well as in optimizing the application.

Infrastructure metrics give an overall picture whereas application metrics help in drilling down to the specifics. For example, if the infrastructure metric indicates more than 100% CPU utilization, application metrics help in zeroing in on the cause of this.

Business metrics:
Business metrics are the numbers which are crucial from a functionality point of view. For example, if the piece of code deals with user login and sign-up, some business metrics of interest would be the number of people who sign up, number of people who log in, number of people who log out, the modes of login like social login versus direct. Business metrics help in keeping a pulse on the functionality and diagnosing feature specific breakdowns.

Business metrics should not be confused with business reports. Business metrics serve a very different purpose; they are not to quantify numbers accurately but more to gauge the trend and detect anomalous behavior.

It helps to think of infrastructure, application and business metrics as a hierarchy where you zoom in from one to the other when keeping a tab on the health of the system as well as diagnosing problems. Keeping a check on all three ensures you have hale and hearty application.

Logging:
Logging enables to pinpoint specific errors. The big challenge with logs is making logs easily accessible to all in the organization. Business metrics help in tracking the overall trend and logging helps to zero in on the specific details.

Distributed Tracing:
Distributed tracing ties up all the microservices in the ecosystem and assists to trace a flow end to end, as it moves from one microservice to another. Microservices fail all the time; if distributed tracing is not in place, diagnosing issues which span microservices feels like searching for a needle in a haystack.

Alerts:
If you have infrastructure, application and business metrics in place, you can create alerts which should be triggered when they show abnormal behavior; this pre-empts potential downtimes and business loss. One golden rule for alerts is, if it is an alert, it should be actionable. If not, alerts lose their significance and meaning.

Both commercial, as well as open source software, are available to build observability. NewRelic is one of the primary contenders on the commercial side. StatsD, Prometheus and the ilk dominate the open source spectrum. For log management, Splunk is the clear leader in the commercial space. ELK stack takes the crown on the open source front. Zipkin is an open source reference implementation of distributed tracing. Most of the metrics tracking software have alerting capabilities these days.

If you already have microservices or are moving towards that paradigm, you should be investing heavily on observability. Microservices without observability is a fool’s errand.

Sherlock Versus Calvin Ball

We can classify software development into:
1. Maintaining and enhancing existing software.
2. Software development from scratch.

Given a choice between the two, developers usually gravitate towards from scratch development. Developing something from scratch is an intensive creative work where you have the freedom to shape the product the way you see fit. Hence, it is pretty obvious why people prefer this. I draw a parallel here with Calvin Ball. For those of you not familiar with Calvin ball, it is a game that Calvin invented where he makes rules on the fly during the game. From scratch development is akin to Calvin Ball, you can create and amend rules on the fly. If you chose a framework and in the course of development you see it does not fit the bill, you have the freedom to swap it with something else. You are operating under a lot of degrees of freedom.

calvin_and_hobbes_original

Maintaining and enhancing existing software is more like solving a puzzle or playing a game with well laid out rules. Someone has already laid the foundation or in a lot of cases built the entire structure. You first have to expend time and effort in groking this and familiarising yourself with what is already there, only then you will be able to do something. A lot of times you need to get into the mind of the original developer and decipher things from her perspective. Working on code written by others is more like Sherlock Holme’s work. When you do changes and enhancements, you have to ensure what you are doing fits well into the existing framework. You are working in a constrained environment; you have to stick to the rules of the game. All this is as much or sometimes more challenging than developing software from scratch.

sherlock-3828991_640

Debugging is an acquired skill which carries over to all areas of development. When you troubleshoot code written by others, you become more attuned to add enough debugging information in the code you write. You will start empathizing with the person who will maintain your system in the future and ensure that person has enough data points to debug when things go wrong. It might as well happen that that future person is you only. Injecting debugging information and future proofing your project is a fundamental behavioral change that maintenance induces in you.

There is nothing wrong in preferring to create something from scratch, but it is imperative to have the second skill set under your belt. The real world requires more of type two work than type one. If from scratch development is all you have done till now, it is high time you challenge yourself with category two work. You will feel a bit frustrated and handcuffed in the beginning, but the way to approach it is like solving a mystery. If you see it that way, it becomes a fun and entertaining experience.

PS: Calvin and Hobbes image taken from Wikipedia.

Concurrency Models

We can roughly classify concurrency models into:
1. Thread based concurrency.
2. Event based concurrency.

Imagine that you run a store with only one customer service representative. As soon as a customer walks in, the customer service representative greets the customer with a quick hello saying – “If you need any help, give me a shout, and I will help you out.” She then waits for the customer to seek help. She aims to complete the interaction as soon as possible and wait for the next communication. When a customer asks for help, she quickly answers the query and goes back to waiting. If a customer asks where is the washroom, she points in the right direction quickly and reverts to waiting. If a customer asks her for the price of a product, she quickly conveys the price and goes back to waiting. The point to note here is that there is only one customer service representative for the entire store servicing all customers. This model works exceptionally well when the representative is fast, and the answers to the queries are quick. Concurrency based on events works like this.

Now consider the situation where you have five customer service representatives in your store. As soon as a customer walks in, a representative is assigned exclusively to that customer. When another customer walks in, one more representative is picked from the pool and assigned to the customer. The critical point to note here is that there is a one to one relationship between the customer service representative and the customer. When one representative is servicing a customer, she does not bother about other customers; she is exclusive to that customer. Since our pool has five representatives, at most, we can serve only five customers at a time. What do we do when the sixth customer walks into the store? We can wait until one of the customers walks out or we can have a rule saying that a representative services a customer for a fixed period after which she will be assigned to another waiting customer. She is reassigned to the original customer once the time elapses. Concurrency based on threads works like this.

Coming back to the scenario wherein the sixth customer walks in. Now, we have to ask the sixth customer to wait until a representative is free. On the other hand, we have to wean away a representative from one of the existing customers and assign her to the new customer. When this happens, the customer who was initially being serviced by this representative has to wait. After the elapsed time, we have to assign the representative back to the original customer. When a lot of customers walk in, and you have a fixed no of representatives, quite a bit of coordination is needed to service all customers satisfactorily. In a computer, the CPU scheduler takes care of switching between tasks. Switching is a comparatively time-consuming operation and an overhead of the thread based concurrency model when compared to an event based one.

In the single representative scenario, what happens if one of the customers starts a long conversation with the representative? The representative will be stuck with the customer, and if other customers have queries, they will have to wait for the representative to finish the ongoing conversation. Also, what if one of the customers sends a representative on a long-running errand like fetching something from the depot a mile away? Until the representative returns, all other customers have to wait to get their queries resolved. One egregious customer can jeopardize all other customers and hold up the entire store operation.

Hence, when working with event based concurrency, it is essential not to:
1. Carry out CPU intensive tasks akin to having a long-running conversation with the representative.
2. Carry out blocking IO tasks similar to sending the representative to the depot.

superhero-534120_640

NGINX and Redis are probably the most commonly used software that leverage event based concurrency. The workloads that these cater to are quick. Hence event based concurrency makes perfect sense here.

Taking the case of NGINX used as a reverse proxy, what does it do? Pick a client connection from the listen queue, do some operations on this and then forward it to the upstream server and then wait for the upstream to respond. While waiting for the upstream, NGINX can pick more client connections from the queue and repeat the above. When the upstream sends a response, it relies on this back to the client. Since all these are short-lived operations, this fits beautifully into an event based concurrency model. Good old Apache HTTP server creates a thread/process for each connection to do the same. The no of threads it has constraints apache. If the number of incoming requests is more than the number of threads in its pool, it has to deal with switching and coordination. NGINX does not have this overhead which makes it comparatively faster than Apache in real-world workloads. All of this is a bit simplistic and hand-wavy but should convey the idea.

Event based concurrency cannot leverage multiple CPU cores which all modern processors have. To do this, you create one event unit for each core usually called a worker. Also, most software that leverage event based concurrency adopt a hybrid model where they use event based concurrency for short-lived quick operations and off-load long-running tasks to a thread/process.

I have glossed over a lot of details and nuances to explain a complex topic like concurrency in simple terms. Treat this as a good starting guide to dig more into this fascinating world.

Ode To Queues

If you have a producer with an uneven rate of production and a consumer which cannot keep pace with the producer at its peak, use a queue.

If you have a workload which need not be addressed synchronously, use a queue.

If your customer-facing application is riddled with workloads which can be deferred, move these to a queue thus making the customer-facing application lean and mean.

duck-3217049_640

Think of a queue as a shock absorber.

There are workloads which need to be processed immediately with sub-millisecond latency, and then there are ones where you have the luxury of taking time. It is advisable not to mix these in an application. The second kind of workload can be addressed by moving it to a queue and having a consumer process them.

For example, consider a scenario where you are consuming messages and persisting them in a data store. These messages are coming in at a variable rate, and at its peak, the data store cannot handle the load. You have two options. Scale the data store to meet the peak load or slap a queue in between to absorb the shock. Queue solves this problem in a KISS manner.

Queues enable applications to be highly available while giving enough room to maneuver. As long as the queue is highly available, the chance of message loss is almost nil. Since a queue is durable, you need not perfect your consumer’s high availability; you get leeway to manage.

With applications embracing microservices paradigm, there is a lot of API back and forth. Not all API consumption has to be in real-time. Whatever can be deferred should use a queue as the transport mechanism.

Queue introduces a bit more complexity into an application but the advantage it brings to the table makes it a worthwhile investment.

Software Security

Some disparate thoughts on security in no particular order.

Many security bugs can be avoided by making a clear distinction between authentication and authorization. When one logs into Facebook, one uses a username and password. Facebook lets you log in only once it is sure that you are the owner of the account by verifying your password. This is authentication. Once you log in, you cannot view all your friends’ photos. You can only view those photos which your friends have authorized you to view. This is authorization. There is a class of security bugs that arise because developers have not made this distinction.

security-department-1653345_640

A lot of security is knowing what not do. Security by obscurity and hand rolling security algorithms and protocols are the two things that immediately come to my mind. For example, while storing passwords, instead of coming up with an elaborate custom secure storage scheme, employ the industry standard bcrypt.

There is a thought process that you will do better security by having tons of access control. One of the manifestations of this is restricting SSH access to production boxes. Unless you have invested tons in tooling, this slows down teams drastically. In today’s world, where speed is paramount, this does not work. Under pressure to do things fast, teams find ingenious ways to circumvent these controls. Strict access control only works in organizations which are fine with taking things slowly but this usually stifles productivity and leaves a bevy of frustrated developers. The only way around this problem is to have the most necessary access control and take care of the rest through tooling. An example is how Netflix uses tools to enable developers to SSH into production boxes without compromising security.

Security implemented in a naive manner goes against the human nature of seeking to accomplish tasks in the least restrictive manner. If you do not invest in tooling, security always gets in the way of accomplishing things.

A less intrusive way of doing security is to configure systems with sane defaults. An example – when you provision a server, ensure that it is fortified by default. If you are using external tools, configure them with defaults. For example, if you are using Slack, configure it so that only people with your organization’s email address can sign up. Carry out a periodic audit of systems. This could be anything from periodically scanning SSH access logs to repository audits to ensure secrets and passwords are not leaked.

No writeup on security can be complete without touching upon compliance. There are tons – PCI, HIPAA, SOX etc. All these come with their own baggage. One simple way around this is to first understand what and all parts of your application have to be under the scope of compliance. For example, if you have an e-commerce application taking credit card information, you have to be PCI compliant. But this does not mean your entire application has to be under the scope of PCI audit. You can smartly bifurcate the application into parts that deal with payment and parts that do not. Once this is done, only the parts that deal with payment have to be under PCI scope.

A final note, security is a never-ending concern, there is nothing called enough security. Where you draw the line is up to you.

Here is a hilarious comic by XKCD on teaching a lesson to people who do not follow security practices.

exploits_of_a_mom

Naming Things

There are only two hard things in Computer Science: cache invalidation and naming things.

— Phil Karlton

Even though the above might have been in jest, naming variables while writing code is a head-scratching experience. Should I make it short? Should I make it descriptive? If descriptive, how descriptive? These thoughts keep running in one’s head.

tag-309129_640

A simple strategy is to keep the descriptiveness of a variable’s name in line with the reach of that variable. If the variable is short-lived i.e within a small block, stick to a short name as the cognitive load of the variable is negligible. If the variable’s reach is much larger, as in if it spans a large number of lines, make it as descriptive as possible.

Goes without saying that names should adhere to the conventions that your team has adopted.