Skip to main content

A Theory-Oriented Real-Time Operating System for Temporally Sound Cyber-Physical Systems

Periodic Reporting for period 2 - TOROS (A Theory-Oriented Real-Time Operating System for Temporally Sound Cyber-Physical Systems)

Reporting period: 2020-07-01 to 2021-12-31

Virtually all modern technologies humans interact with every day — such as cars, airplanes, trains, satellites, or robots — are so-called cyber-physical systems. This means that embedded within them are computers and that these computers are essential to the system’s overall correct operation as they closely monitor, control, or generally interact with the physical environment. Naturally, many computers embedded in cyber-physical systems are inherently time-critical; that is, they must react to events in the physical world within stringent time bounds.

For example, consider the flight controller in a consumer quadcopter UAV (or “drone”) used for aerial videography. To maintain stable flight, the controller must react to any gusts of wind within a few milliseconds, lest it be blown out of position or even crash. A video drone is a typical example of a time-critical cyber-physical system — any undue delays exhibited by its flight control software can lead to incorrect operation or even total system failure.

Such time-critical computing systems, which must satisfy time constraints in the “real” world, are also known as real-time systems. Just like regular computing systems, real-time systems typically comprise two or more software layers, with a real-time operating system (RTOS) as the “bottom” layer and real-time applications and various frameworks, runtime environments, or middlewares forming the layer(s) above.

As the name suggests, an RTOS must manage and multiplex a system’s hardware resources so that multiple applications can run safely side-by-side, just like any other operating system. However, the crucial difference is that it must do so in a way such that all real-time applications “on top” can predictably meet their timing requirements. More precisely, an RTOS must offer facilities (i.e. APIs) such that developers of real-time applications can ensure that their applications execute in a predictable and timely manner.

The TOROS project seeks to radically rethink the design of modern RTOSs, what facilities they should (and should not) offer to application programmers, and ultimately how to make it easier to obtain temporally predictable real-time systems. In many ways, contemporary RTOSs differ not much from those designed 30–50 years ago, as commercial RTOSs tend to evolve slowly, conservatively, and in a backward-compatible manner. Instead, the TOROS project asks: what would an RTOS for the next 50 years look like if we were to start over?
While a quick look out the window offers ample evidence that it is quite possible to build well-functioning cyber-physical systems with today’s RTOSs and analysis methods, what is not readily observable is the enormous cost involved in doing so. This cost manifests in multiple ways.

First and foremost, there is a large expertise barrier: current RTOSs expose low-level mechanisms that require too much expertise to be used in a temporally sound way. As a result, most deployed applications are practically unanalyzable, and where analyzability is mandated (e.g. in avionics), development costs are infamously high. Rather than blaming insufficient developer training, the TOROS project instead sees as the root cause a failure of abstraction in contemporary RTOS design: proper abstractions should elide the underlying mechanisms.

To address the issue, the TOROS project has developed an unorthodox process-less RTOS design centered around three orthogonal, high-level abstractions that correspond to (1) program state management, (2) short-lived computations with run-to-completion semantics (similar to event handlers), and (3) declarative management of processor time. Together, these three abstractions yield a minimal API that combines two essential properties: (i) idiomatic applications are guaranteed to be automatically analyzable (even if developers are entirely oblivious of the underlying details), and (ii) the API is sufficiently powerful and expressive to allow complex, real-world applications to be implemented (in a reactive, event-driven programming style).

A second major issue is that most applicable analyses rely on idealized worst-case execution-time assumptions that cannot be satisfied in practice. This cost manifests as prohibitive over-provisioning in temporally sound systems.

In response, the TOROS project is developing below-worst-case provisioning methods that replace commonplace “guesstimation techniques” with analysis-driven methods for determining acceptable safety margins at reasonable costs. Most recently, the project has pioneered the use of Monte Carlo methods to this end.

Finally, a third aspect that makes the cost/benefit ratio of sound methods unattractive is that available real-time theory depends on complex and tedious proofs that have sometimes turned out to be flawed in the past. To improve in this regard, the project is working towards verifying the analysis underlying the TOROS system with the Coq proof assistant.
Going forward, work will continue along the three research vectors sketched above, with the goal of obtaining a mostly complete demonstrator OS that combines the proposed design and all individual advances in one consistent, working system. To this end, further advancements are needed in all aspects.

In the design of TOROS, a key element that still represents an unsolved problem is the communication protocol for the synchronous invocation of computations (i.e. with call-return semantics). In particular, a protocol with strong isolation properties is required to retain the automatic analyzability of idiomatic applications. Specifically, it must be the case that invocations of “unrelated” functionality or services must not interfere with one another, which is obviously desirable but not at all trivial to achieve for technical reasons that go beyond the scope of this summary. The project hopes to obtain the first protocol of this kind and prove its properties in the coming year.

Concerning below-worst-case provisioning, several major challenges remain. On the system side, the project plans to introduce a fourth foundational abstraction into the TOROS design, namely dedicated “slack pools.” The idea is to make the provisioning of safety margins an explicit OS feature with first-class support. Once established, a slack pool can represent extra processing capacity held back in reserve exclusively for certain computations that may “tap into” the pool if exceptional demand is encountered. The slack pool concept poses several non-obvious analysis, policy design, and implementation issues, especially when combined with slack reclamation techniques. To date, no existing RTOS possesses similar capabilities.

On the analysis side, one particularly tricky issue in the context of below-worst-case provisioning is that of correlated execution times. While an exact solution appears infeasible to obtain, the project hopes to find a good tradeoff with acceptable pessimism and feasible analysis runtime.

Last but not least, there is much room for further verification advances. Currently, the project is developing the first foundational tool for verified response-time analysis, where “foundational” in this context means that the tool itself needs to be neither verified nor trusted. Instead, the tool generates certificates of correctness for its output in the logic of the underlying proof assistant (Coq), which can then be independently verified.