Users are lost in the noise
Windows, Android, iOS … all of today’s operating systems are now suffering from a malaise that runs deep, manifesting itself as a device which doesn’t respond when you interact with it. I don’t believe it is acceptable for a laptop, iPad or Smart Phone to just decide that it has more important things to do than pay attention to its user. Here are some thoughts on why this happens (which goes to the root of how we develop devices) and what we could do about it.
The design pattern followed since the time-sharing systems of the 1960’s is that the user-interface is just one of many tasks juggled by the operating system. This “separation of concerns” allows software engineers to focus on writing software, and UI designers to focus on designing the experience, but this split has now led to an unacceptable user experience which neither OS or UI can fix on their own. The user interface may be assigned a higher OS “priority” to try to make it more responsive, but to the operating system you, the user, are just like any other task, which in my opinion is just plain wrong and it’s time to recognise that and fix it. Virus-scanning, picking-up email, file syncing, software updates… the list of tasks that a modern laptop or Smart Phone OS is expected to juggle is long and getting longer, and the user has disappeared into a morass. It’s time for a fundamental rethink of how software engineers deal with user experience.
We all recognise the situation: We go to interact with our device by pressing a key, waggling the mouse or touching the screen, but our device takes seconds or more to get around to responding to us. If we’d just asked it to go and do some heavy work then we could forgive it, yet increasingly this happens when the device isn’t heavily-loaded at all: CPU cycles low, disk utilisation low, network utilization low. The device just isn’t responding and yet is apparently just sitting there lost in introspection, not doing anything else useful either.
Why it happens
So how does this happen? One technical reason is the classic “priority inversion” which is a bane of multi-tasking: A low-priority task (e.g. updating your PDF reader to the latest version) grabs a resource, then gets shunted-out of the way because a high-priority task (such as the UI) now needs to run, but the UI needs that same resource, so it has no option but to wait until the low-priority task has stopped using it. If the low-priority task is itself waiting for something to happen (for example for a network query to a remote machine to return) then the whole device stops responding even though it is not doing anything else. This highlights the fact that “priorities” are a very bad way to allocate resource, because priority depends on context, and context changes.
In the 1990’s I worked at Chromatic Research on the world’s first commercial “media-processor” – a device which delivered all the media functions of a PC (audio, modem, 2D, 3D and video) in software, which was a world first. All these media tasks are obviously time-critical: the next frame of your video really does need to be decoded 1/30 sec after the previous one to avoid video stutters, and audio is even more critical. So our kernel (the low-level heart of our operating system) used so-called “Earliest Deadline First” scheduling – each task had a deadline, and the task with the most urgent deadline was the task chosen to run. While useful for media processing, this isn’t a panacea for everything, but it illustrates the point that completely different schemes other than “priority” are possible for helping an OS decide what is most urgent.
I propose an approach which recognises that user input is different from all other tasks. The user OWNS the device: they have paid for it, it exists for their pleasure, and their wishes are surely more important than the wishes or convenience of anyone else – including the hundreds of software vendors whose software is running on that device.
The idea I’d like to propose borrows from a concept in modern programming languages such as Python called “tainting”, which works as follows: Any input which a program takes from the outside world is marked as “tainted”, and any data subsequently derived from that data also gets automatically tainted by association. This makes it possible, deep in the heart of the software, to keep track of what is internal data (and therefore trusted) and what data came from outside (and is therefore exposed to hackers and should not be trusted). So for example when the software is constructing an SQL query, it can simply check before running it that the query contains no tainted data. This ensures that nothing that a user types into a web form (e.g. a hacker typing malformed entries) can end-up being run as code. John von Neumann’s clever realisation that code and data can exist in the same address space is the root of many security problems, and tainting is one tool to help address this. Anyway, in this proposal we take the principle of tainting and turn it on its head…
Everything which comes directly from the user (via the user interface) is automatically “blessed” – all mouse data, touch data, keyboard strokes. Blessedness, like tainting, spreads. So any task using blessed data itself becomes blessed. And any other task interacting with that task becomes blessed. Over time, blessedness fades, so if I ask my device to download a 1TB file, then during the process of downloading that blessedness gradually fades. This captures the principle that whatever the user did most recently is most urgent, beating even things they asked to get done earlier.
Acting on Blessedness
So, blessing keeps track of what’s important. But now how do we ensure that a device can always act on that knowledge to keep the device responsive? We need to do some minor heart-surgery on our OS.
When Android first appeared, I was delighted with its “take no prisoners” approach to task management. If a task ever goes off for too long without occasionally returning control to the event loop (which renders the task and potentially the whole device unresponsive) then Android simply kills that task off with extreme prejudice.
This is a good approach not only because it saves you in that particular instance, but also because software engineers learn that if they write lazy code which sometimes doesn’t return, their code gets killed, so it exerts what evolutionary biologists might call “selection pressure” on code to be better-behaved. Be a good citizen and the world will treat you well.
Unfortunately as Android has matured this principle has been diluted – so now for example if you tell Android to kill a task, it throws up all sorts of warnings that you may “lose data” etc. Bad! This is caving-in to the software-engineer-first approach, which allows engineers to live in a dreamworld where their little task is the centre of the world and will persist forever. This principle stems from 1960’s multitasking and is IMHO the root of the problem, because if you accept this thinking then you also accept the idea that when the user’s needs conflict with the task’s requirement to persist forever (for example because of priority-inversion) then the user loses. Which surely is the wrong answer.
Angel of Death
I would like to propose going in exactly the opposite direction. If a blessed task is being prevented from running, then the operating system (or perhaps some “Angel of Death” process sitting alongside it!) steps in to take whatever action necessary to allow that blessed task to run. If the problem is priority inversion (i.e. some other task currently owns a resource that our blessed task now needs) then that other task is summarily terminated and the resource reassigned to the blessed task by brute force. This “take no prisoners” approach to resource ownership, task existence and probably many other aspects of system design would in turn lead to a world where software engineers recognise that they must write their code accordingly. Those that don’t (e.g. that take resources for long periods) find their code doesn’t work very well because it keeps getting killed-off. Users will stop buying those apps, and evolution will drive behaviour that is more user-centric.
If you’re a software engineer, I expect you’re thinking “yes but” at this point, and I too can see many challenges with this suggested approach. But I think it’s worth exploring, not least because it does seem quite well-aligned with another shift that’s happening at the moment: asynchronous programming, such as Node.js, whereby yesterday’s monolithic code is reduced to small snippets which run very quickly and don’t ever wait on external resources, which has many benefits. Another related new paradigm is the exokernel OpenMirage, which is structured to make it natural to avoid mutexes.