Periodic Reporting for period 4 - TOROS (A Theory-Oriented Real-Time Operating System for Temporally Sound Cyber-Physical Systems)
Reporting period: 2023-07-01 to 2023-12-31
As the name suggests, a cyber-physical system (CPS) is a complex system comprising both computing (i.e. “cyber”) elements and physical processes that interact and influence each other. For example, cars, airplanes, and trains are all common examples of CPS encountered routinely in everyday life. More generally, most modern technologies, including transportation, manufacturing, medical devices, and automation, are inherently CPS: they could not work without embedded computers that closely monitor, react to, control, or generally interact with the physical environment, and the correct functioning of these embedded computers is critical to the proper operation of the whole system.
One aspect that makes such a CPS particularly challenging to engineer — and hence interesting to study — is that the computing systems embedded within them are often inherently time-critical. That is, the embedded computing systems must react to inputs representing events in the physical world within stringent time windows. For example, consider the flight controller in a consumer UAV (or “drone”) used for aerial videography. To maintain stable flight, the controller must react to any wind gusts within a few milliseconds. The timing here is essential: a delayed reaction by the onboard computer could result in the UAV being blown out of position, or in the worst case, even result in instability and ultimately crash the UAV. Such timing constraints are the rule rather than the exception and can be found in most cyber-physical applications: the physical world never waits.
Consequently, each computing element in a CPS typically runs a
real-time operating system (RTOS), which is a special kind of operating system designed specifically for time-critical applications. In particular, an RTOS must manage and multiplex an embedded system’s hardware resources and devices so that all tasks running “on top” of the RTOS can predictably meet their timing requirements. More precisely, an RTOS must offer facilities (i.e. APIs) such that developers of real-time applications can ensure that their applications execute in a predictable and timely manner. The design and implementation of these APIs and the underlying policies chosen to arbitrate access to shared resources (such as processor time) greatly affect whether timing guarantees can be given and how strong those guarantees are.
The societal relevance of the CPS domain, and by extension, the societal relevance of the RTOS deployed at the core of each CPS, derives from the ubiquity of CPSs — they are pervasive in modern society — and from the fact that it is not unusual for CPSs to be deployed in safety-critical contexts (e.g. airplanes, cars, autonomous systems). Safety-critical uses typically come with safety certification requirements (i.e. products must be shown to be safe before commercial availability). For time-critical components, safety certification includes (among many other requirements) a need to demonstrate, ideally via static analysis, that all temporal requirements will be met at runtime.
Unfortunately, this is where there still exists a large gap between practice and state-of-the-art research: the design of currently employed RTOSs, hardware characteristics of commonly used off-the-shelf computing platforms, limitations in the available research literature, and complexities inherent in modern CPSs all conspire to make it highly challenging, and often economically unviable, to rigorously demonstrate that all temporal requirements can be met at runtime. Additionally, even in those cases where current techniques work, the results are insufficiently explainable and trustworthy due to the high complexity of the underlying theory.
The TOROS project is motivated by this gap and, at a high level, investigated three main questions:
1. How can we change RTOS design and APIs to enable better temporal analysis or to make temporal analysis feasible in the first place?
2. How can we analytically better deal with the temporal uncertainty inherent in contemporary hardware and software stacks?
3. How can we make applicable analysis more trustworthy and explainable?
1. The TOROS team developed a synchronous inter-process communication (IPC) protocol named G(IP)^2C. Notably, it is the first IPC protocol offering strong temporal isolation and support for server-to-server invocations. In this context, strong temporal isolation means that unrelated invocations must not delay one another. The second distinguishing feature, support for server-to-server invocations, means that an ensemble of cooperating servers can fulfill a client’s request rather than just a single server process. Analysis-friendly support for server-to-server invocations is challenging to achieve, particularly without invalidating the aforementioned temporal isolation guarantee. The project team’s work on the G(IP)^2C protocol was recognized with the RTAS’23 Best Paper Award.
2. Concerning analysis trustworthiness, the project team presented the first formal verification of the busy-window principle, a central method in the real-time systems literature. From the verified principle, the project team obtained verified response-time analyses for 8 common schedulers and workloads. Of these, 3 were entirely novel, and all but 1 had yet to be verified. The project team’s verification effort won an ECRTS’20 Outstanding Paper Award.
3. Building on (2), the project team developed POET, the first foundational tool for verified response-time analysis. “Foundational” in this context means the tool itself, POET, needs to be neither verified nor trusted. Instead, POET generates certificates of correctness for its output in the logic of the underlying proof assistant (Coq). Notably, POET’s certificates represent explainable evidence in the sense that an independent expert can inspect the claimed bounds to any desired level of scrutiny, down to the level of the proof assistant’s fundamental axioms. POET won an ECRTS’22 Outstanding Paper Award.
4. Investigating the issue of uncertain execution times, the project team developed the first probabilistic response-time analysis based on Monte Carlo simulation. The main benefit of the proposed method is that it provides a controlled trade-off between analysis runtime, the desired degree of accuracy, and the permissible probability of a misestimate. While Monte Carlo simulation is a classic technique, it had not previously been applied and rigorously justified in this context. The project team’s work on Monte Carlo response-time analysis was recognized with the RTSS’21 Best Paper Award.