Monday, October 15, 2018

Programmer vs Software Engineer: The Interview

A common question presented in interviews for developer positions goes something like this:
Given an array of numbers, write a function that returns the three largest values. Is your approach the optimal approach?
A programmer and a software engineer will give you two different answers to this question. Let's look at each hypothetical answer, then discuss a bit which is better, and why.

The Programmer Answer


The approach begins by checking whether the array has at least 3 values,  if it doesn't, return a copy of the array, which has been sorted. Otherwise, grab the first three items from the array, sort them into descending order, and then iterate through the rest of the input, looking for values which are larger than the three items you've already obtained. In code (JavaScript today) it looks something like this:

function findTopThree(arr) {
    // Handle trivial case
    if (arr.length <= 3)
    {
        return arr.slice(0,3).sort();
    }
    
    let result = [arr[0], arr[1], arr[2]];
    result.sort((a, b) => b - a);
    
    for (let i = 3; i < arr.length; i++)
    {
        if (arr[i] > result[2])
        {
            if (arr[i] > result[1])
            {
                if (arr[i] > result[0])
                {
                    result.unshift(arr[i]);
                    result.pop();
                }
                else
                {
                    result.splice(1, 0, arr[i]);
                    result.pop();
                }
            }
            else
            {
                result[2] = arr[i];
            }
        }
    }
    return result;
}
For the worst case, an array sorted into ascending order, this solution runs in 3n comparisons. (Plus a constant for sorting the first three items.)  There are several other ways to do this, like Math.max, but this method will tend to run less comparisons for randomly ordered data than repeatedly finding the maximum. In the best case, an array in descending order, it runs n comparisons.

But is this the optimal solution? The short answer is maybe, but maybe not. To understand why, let's look at the software engineer's answer.

The Software Engineer's Answer

The 'best' solution, is not necessarily the answer given above. To determine how to implement this algorithm we need to consider some additional details that the question doesn't consider.

1. Will we be doing this a lot? Do we occasionally want the top 4? If we sort this data, and keep it sorted, finding the top n values in the array is a constant-time algorithm. If we are going to be doing this a lot, it may make sense to sort the data.
2. Later on in the code, do we need the median value? If so, we'll have to sort the data anyway to find the midpoint, so we can combine the two operations, sort the array once, and find all the values we care about at the same time.
3. What is the source of the values in this array? If it's a list of anonymized test scores, we can sort the values without a problem. If the data is an image, we can't sort it without destroying the data's spacial correlation. We can't decide the best algorithm without knowing what's in the array. 
4. Is the data even something that can be sorted? Let's say our array is a set of process objects representing a group of things running on our server, and the top three values we're interested in are the memory use of the three processes that are using the most memory. In this case, it doesn't make sense to try to sort the data and keep it sorted, since the values can change at any time. We'd have to re-sort the array any time we wanted to deal with it.
5. Is the array small? If it's small then the trivial solution that makes a copy, sorts it, and returns the top three items may be best. It's certainly the easiest logic to understand and maintain in the future.
6. Does this code run in the main time-critical code path, or is it diagnostic code that runs after our process has suffered a fatal exception? If it's time-critical we need to worry about making it fast. If it's not, we can (and should) write it to be easy to understand and maintain, which may come at the expense of its execution speed. If the process has already crashed, it probably doesn't matter if we take an extra 500 ms to process some information.

These are just the first 6 things I could come up with when thinking about the question. There are certainly more things that would change how the algorithm is implemented.

The Takeaway

The lesson from all of this is:
Before deciding if a solution is optimal, make sure you understand the problem and the environment that your code will be running in. An O(n) algorithm is not inherently better than an O(n2) algorithm.
As a bonus, when you better understand how your code fits into the project, you can better reason about possible edge-cases, but that's a topic for another day.

No comments:

Post a Comment

All comments are moderated. I have a life, so it may take some time to show up. I reserve the right to delete any comment for any reason or no reason at all. Be nice. Racist, homophobic, transphobic, misogynist, or rude comments will get you banned.

Programmer vs Software Engineer: The Interview

A common question presented in interviews for developer positions goes something like this: Given an array of numbers, write a function th...