Measuring Software Performance

Measuring performance helps us to define exactly what we mean by performance, to quantify it, and ultimately to identify where to expend effort improving it.

What is performance?

When I think about performance, I instinctively think about speed. But what does that actually mean? Obviously software can't move anywhere, so we need to be more scientific about it: how do we actually quantify software performance? It's only when we have a way of measuring performance that we can start to objectively analyse and compare it.

In this article I present the two most common performance metrics.

Elapsed Time

This is probably the easiest performance measurement to understand, since it is something we use frequently in everyday life. When we discuss the performance of cars, we quote the time it takes to accelerate from (e.g.) 0 - 100 kilometers per hour (kph). When we discuss the performance of a courier, we look at the time it takes to deliver an item. Likewise, a web site is judged on how long it takes to load a page.

We understand that the shorter the time the better the performance, and we are able to compare the performance of different cars, couriers, or different sites using that figure. Elapsed time as a measurement of performance is both useful and intuitive.

We can pretty much measure anything we like, although obviously some measurements of time are more useful than others: I don't rate a washing machine based on the time it takes to do a wash (although manufacturers often will), because (a) if it were to take 10 minutes it is unlikely that my clothes would be clean, and (b) for most purposes it doesn't really matter too much (unless it took more than half a day!) - I generally put the washing on, go away and to other things, and then come back to it when the machine beeps at me to tell me it's finished.

When we measure time, we do so between two events: when we quote the 0 - 100 kph performance of a car, we give the time between the instant that the car starts to move, and the instant that it is travelling at 100 kph. These events are definite moments in time. It is also important to use exactly the same measurement when comparing performance: it would be pointless to compare one car's 0 - 100 kph performance with another's 0 - 110 kph performance because that wouldn't be meaningful. It would also be misleading to quote the time "just before" the car reached 100 kph (much as I'm sure some manufacturers would love to do so) because again, it would be meaningless. These boundary events must be definite and repeatable.

Commonly, when talking about computer-related performance, we measure response time. This is a narrower application of elapsed time that measures the time between a specific trigger and a completed task. In many, but not all cases, the trigger will be something that a user has initiated - such as clicking on a hyper-link or button. So some examples of response times that we may be interested in, are the times taken for:

A web page to load once a link on the current page has been clicked
A application to start on a PC once its icon has been double clicked
The time for a screen to refresh on a mobile after a button has been touched

Throughput

Throughput is a slightly less intuitive measure of performance, but is nevertheless both common and important. Rather than measure a variable time interval, we fix the time, and count how many "things" we can get done in that time. "Things" can be anything we choose: transactions, deliveries, tasks, or whatever is meaningful to us.

Some everyday measurements of throughput are:

The number of letters a postal organisation processes per day
The number of trades a financial exchange handles per minute
The number of hits a web site receives (and processes) per day

It is interesting to note that we tend to talk about throughput in reference to more centralised activities - facilities, organisations, or industries as a whole, rather than end-users or consumers.

So, for example, a postal organisation might measure the number of letters that it delivers a day. In fact, for the company, that may be more important than the time it takes to deliver any one particular letter. Here we can see that different things are relevant depending on your perspective: as far as I'm concerned, I couldn't really care less how many letters the postal company delivers in a day - all I care about is how quickly my letter arrives. From the company's perspective, it needs to perform a delicate balancing act: the number of letters it can process in one day is a vital measure of its capacity, but although the exact delivery times are less important, it still needs to keep an eye on the average time it takes its customers to receive their mail. There's no point in having a massive capacity if it takes customers a week to receive their post.

Elapsed Time vs Throughput

It is reasonable to ask why we use two different measures for performance. Surely these two are the same thing? If you improve the response time (by reducing it) then you automatically improve the throughput (by increasing it), and visa-versa, right?

Actually no.

Whilst they are often related, and for simple 'systems' they may be closely correlated, they are not simply different expressions of the same thing.

Let's first look at a simple example where they are related. Imagine a single clerk checking, say, application forms - to make sure they are filled in correctly. If that clerk takes 1 minute to check a form, the throughput is 1 form per minute. Now if, after they become more experienced at checking forms, it only takes 30 seconds to check a form, the throughput now becomes 2 forms per minute. In this case, the throughput is inversely proportional to the elapsed time.

In real systems the relationship is usually not so straightforward. For more complicated systems, such as the postal service, there is no simple correlation between delivery times and throughput. Indeed, we can easily lose the correlation for our simple form checking simply by adding extra 'resources': recruiting another clerk. Now the two clerks between them can check twice as many forms in the same time as one. The throughput has risen to 4 forms per minute, but the time taken to check a single form is unchanged at 30 seconds.

It should also be noted that for more complicated systems, we often find that improving either response time or throughput can actually result in the other worsening.

Which one is better / which one should I use?

There is no right or wrong answer to that question; it comes down to which one you care about most. In reality, you will almost certainly be performing the same balancing act as everyone else: maximising throughput without sacrificing elapsed time too much. What is important is that you keep an eye on both performance metrics, and make a conscious decision about which one you are prioritising.

In a future article, I'll discuss "the 3rd way": a slightly different type of performance metric, but one which is becoming increasingly important.

The Performance and Tuning Doctor by Sturnus