High-powered internet applications typically need teams of experts to maintain them. Not any more, say European researchers who have built a system to create applications that manage and fix themselves.
Part of the internet’s potential lies in its ability to link hundreds, thousands, or even millions of devices.
Whether a user is downloading a video from a peer-to-peer service, performing scientific research on a grid, or using “cloud computing” to manage a business, programs that let many devices and applications work together are crucial.
The problem, says Peter Van Roy, coordinator of the EU-supported SELFMAN project, is that it’s getting harder to keep those systems working.
“The central challenge when you build big internet applications is how to keep them running without having to tweak and manage them all the time,” he says.
The SELFMAN team set out three years ago to solve that problem by finding out how to build programs that take care of themselves in the rough-and-tumble internet environment.
“We wanted to make big internet applications easy,” Van Roy says, “so that all the management problems you normally have are handled by the system itself.”
The payoff, he says, will be huge. “It will take the internet to the next level.”
Self-management – four key features
The SELFMAN researchers identified four vital functions for a distributed application to manage itself – self-configuring, -tuning, -healing and -protecting.
Software is continually being patched, updated or replaced. For a distributed system to configure itself, it needs to keep track of all its components, update them as needed, and make sure that all parts of the system can still talk to each other.
“Our system can ask a component, what version are you? Who are you talking to? It can then replace an old version with a new one as needed,” says Van Roy.
Self-tuning means that the system can instantly adjust to changing loads and to components leaving or joining the network.
“Suppose one node is getting overloaded,” says Van Roy. “Our load-balancing algorithm allocates new nodes close to that hotspot. It spreads the heat to the other nodes and the hotspot cools down.”
The internet is an unpredictable environment. Routers crash, cables get cut, parts of the system overload and grind to a stop, and components come and go.
“With SELFMAN,” Van Roy says, “each node stores some of the data and each piece of data is replicated a certain number of times. If a node crashes, the other nodes detect the crash, find a new node and give it the missing data. The system heals itself.”
One of the biggest problems SELFMAN tackled was self defence.
The researchers discovered that a system’s security depends on its topology – how nodes are linked to each other. They found that “small world” networks – in which most nodes are not directly linked, but in which any node can communicate with another in a few steps – were the safest.
“With a small world network, it’s easier to detect, isolate, and eject bad nodes,” says Van Roy. “The security service observes the system’s behaviour. If it notices that certain parts of the network are acting abnormally, it takes action.”
It’s all in the architecture
The SELFMAN team found that building these advanced capabilities into useful applications required a highly structured approach.
The foundation of each application is a structured overlay network. That’s a program – itself replicated across the network – that keeps track of all the nodes and connections between them, and can decide when and how to fix problems.
The next level is a replicated storage system. It makes sure that each node has access to the same data, and that data are always replicated to ensure they do not disappear.
The third level houses SELFMAN’s transactional problem-solver. It relies on a sophisticated algorithm called Paxos to provide a systematic way of reaching consensus among any number of fallible components.
Van Roy uses the analogy of a transfer between two bank accounts. “If you want to reduce one bank account by 100 euros and add that 100 to another, you want both or nothing,” he says. “Each node must see the same data.”
“Getting all this fluid behaviour – where even if nodes are crashing or new nodes are coming in or the network has problems it never blocks the system – was a big technical problem,” says Van Ray. “We needed Paxos to get it to work.”
The SELFMAN architecture and components have been used to build some impressive applications. These include a prize-winning distributed Wikipedia that can handle far more queries than the current version, a commercially successful media streaming service, and a graphics program that lets multiple users collaborate on a design.
Van Roy believes that SELFMAN opens the door to a host of high-powered, flexible, and “unbreakable” internet applications. “Right now we’re just scratching the surface,” he says.
Several SELFMAN-inspired applications will be highlighted in a second ICT Results story.
The SELFMAN project received funding from the ICT strand of the EU’s Sixth Framework Programme for research.
Media note: This feature can be republished without charge provided ICT Results is acknowledged as the source at the top or the bottom of the story. You must request permission before you use any of the photographs on the site. If you do republish, we would be grateful if you could link back to the ICT Results site (http://cordis.europa.eu/ictresults). Let us know if you republish so as to help us provide you with a better service. If you want further contact information on any of the projects cited in this story please contact us.