European Commission logo
polski polski
CORDIS - Wyniki badań wspieranych przez UE
CORDIS

Research on Really Reliable and Secure Systems Software

Final Report Summary - R3S3 (Research on Really Reliable and Secure Systems Software)

Current operating systems have poor reliability and security. Computers crash regularly whereas other electronic devices such as televisions never crash. Furthermore, practically every week one reads in the newspaper about another security hole in Windows. Attacks by viruses and worms are rampant. It is no understatement to say that Windows has very serious reliability and security issues. UNIX deriva¬tives such as Linux are not in the press as much. In part it is because they are slightly more secure, but in reality because hackers tend to focus on hitting Windows. As computers become more and more essen-tial for all aspects of society, many people regard this situation as unacceptable. The goal of my research was to investigate, design, implement, and test an operating system that is more reliable and secure than current ones. Most reliability and security problems ultimately come down to the fact that programmers are human and as such are not perfect. While no one disputes this statement, no current systems are designed to deal with the consequences of it. Studies have shown the number of bugs in commercially available software ranges from about 1 bug per 1000 lines of code to 10 bugs per 1000 lines of code. Large and complex software systems, such as operating systems, have more bugs per 1000 lines of code due to the larger number of modules and their more complex interactions. In this project we have taken an operating system that we developed as an educational tool and turned it into a system that is more dependable than existing systems. It is based on a tiny microkernel of about 9000 lines of code (vs. tens of millions for Windows or Linux). Most of the operating system runs as a set of servers and device drivers, each protected by the hardware from interference by other parts of the system. These components communicate with each other using very well-defined protocols. The increased dependability and security comes from this modular design. Each component has limi-ted access to the other ones, which means that they have largely independent failure modes. If one com-pon¬ent fails, in many cases it can be replaced on the fly while the system is running and without distur-bing running programs. Few, if any, other operating systems are self healing like this. Another (related) aspect of the research is that we have discovered how to update the operating system from one version to another while it is running. Software updates are extremely common for operating systems and much other software. Some updates are needed to patch security holes; others are to add new features. With all commercially available operating systems, installing a new version means shutting down the computer and then rebooting it with the new version. This leads to a loss of service while this is going on. For computer systems that control, say, a nuclear reactor, shutting down the con¬trol computer for an update, even a crucial one, is highly undesirable. While it is technically possible to shut down banking and e-commerce servers, their owners do not like this as the customers expect them to run 24/7. We have shown that with our design, this shutdown is not required. The ability to do these live updates also has important security properties. We can replace the operating system intentionally every few seconds, making it much harder for an intruder to attack it, since he doesn’t know anything about the structure of the system as it is currently running. Another important area we have had good results with is testing for dependability. If a new design is claimed to be faster than an existing one, it is straightforward to test that claim. However, a claim of better dependability is very hard to evaluate. Up until our research, people just injected some faults and watched what happened. They rarely understood what was going on. Our research has greatly improved this situation by developing a framework for injecting faults and carefully observing what happens next, a big improvement over the state of the art. Modern CPU chips have multiple cores, but current operating systems use them in a very simple way. We have shown that a better approach is to split the operating system into multiple components, as described above, and run the components on different cores. We have experimented extensively using this approach for the network stack in a system of heterogeneous cores and made some surprising discoveries, for example, sometimes slower cores outperform faster cores due to fewer context switches. As more chips begin to contain heterogeneous cores, this insight could prove to be important. The research also looked at making the file system more robust. To do this, we investigated changing the storage stack from one that is block oriented (the traditional way) to one that is file oriented. Our design has a number of advantages over existing ones, such as the ability to replicate important files while not replicating temporary files that can easily be recreated in the event of a failure.