A Conversation with Mike Deliman
And you think your operating system needs to be reliable.
Mike Deliman was pretty busy last January when the Mars rover Spirit developed memory and communications problems shortly after landing on the Red Planet. He is a member of the team at Wind River Systems who created the operating system at the heart of the Mars rovers, and he was among those working nearly around the clock to discover and solve the problem that had mysteriously halted the mission on Mars.
Deliman serves as chief engineer of operating systems at Wind River Systems. After leaving the University of California at Santa Cruz, where he majored in computer and information sciences, he went to work for a Unix company and was introduced to VxWorks, Wind River’s realtime operating system, later adapted for use in the Mars rovers. “I was very impressed with that very early version of VxWorks,” he says. “As fate would have it,” he adds, “just a few years after starting that job, the company closed its San Jose offices, and I moved to Wind River.” He has worked with NASA’s Jet Propulsion Laboratory on various space projects ever since.
Discussing the role of software in space with Deliman is George Neville-Neil, who is also well acquainted with VxWorks. He developed a device-driver model for networking devices used in VxWorks, worked on a multi-instance version of the Berkeley TCP/IP stack, and ported open source networking code to VxWorks. He has worked in the embedded systems area for the past eight years, both as an integrator of final products and as an implementer of off-the-shelf embedded operating systems. His work has centered on the networking aspects of embedded systems, but he has also done general work on the broader aspects of the systems. Neville-Neil is currently working on a new, commercial, dynamic host configuration protocol (DHCP) server at Nominum. He also teaches seminars and classes.
GEORGE NEVILLE-NEIL How did you wind up working with NASA on its space projects?
MIKE DELIMAN In 1994 Wind River Systems was asked to port its operating system to a radiation-hardened processor based on the IBM Power chip, the 32-bit predecessor to the current PowerPC line. The Power chip was also called the RS6000; the rad-hard version was called Rad6000. I was lucky enough to be asked to help with the Wind River end of the software and became an expert with both the chip and the VxWorks port. Everyone else who had worked with it moved on, but I kept helping the NASA folks use the software in other space-based applications—DS1 (Deep Space 1), SeaWinds, SMEX (Small Explorer Project)-Lite, Genesis, Stardust, SORCE (Solar Radiation and Climate Experiment), Gravity Probe B, and several other deep space probes and satellites.
When the Mars Exploration Rover (MER) project started, I was called and asked who was left from the Pathfinder project that could work on MER. I was it.
GNN How does one “radiation-harden” a processor?
MD Radiation in space takes the form of high-energy particles—protons, electrons, etc.—moving at very high rates of speed, and thus carrying a lot of energy. When these subatomic particles hit something made of metal, they can induce transient charges on the metal. When they hit silicon, they are capable of “burning” holes right into the silicon.
To radiation-harden a processor, you sort of engineer the chip backward. Every year there’s a big push to squeeze more transistors into less silicon, and use smaller and smaller gold or copper vias (wires) over the troughs in the silicon that make up the transistors. The smaller these features are, the more susceptible to voltage transients they become. To make the chips more resilient to power surges that can be caused by protons or electrons, you make the troughs deeper and wider, and use bigger “wires.” To help protect the silicon from being burned-through, the chips are encased in different kinds of ceramic shells that are thicker that they normally would be. A side effect of the bigger features inside a radiation-hardened chip is that it takes more electrical charge to operate normally, and/or its clock rate (the speed at which it runs) must be turned down to allow the necessary charges to build up.
GNN What is your role working with NASA?
MD For the MER project, 2001 through February 2004, I was the chief engineer of the operating system. I did extensions, modifications, bug fixes, investigations, and porting work (new compiler tools)—pretty much everything for the Wind River side of the project.
I’ve since left Wind River Systems and now work as a full-time employee at NASA’s Jet Propulsion Laboratory (JPL).
GNN Can you tell us a bit about your work at Wind River? How many other people at Wind River work on the software for NASA/JPL and how does that relationship work?
MD I was the only one at Wind River working on the Rad6000 software. I consulted other engineers for specific issues, but I was the only engineer responsible for the Rad6000 processor support on the Wind River side.
GNN What was your role during different phases of the mission (launch, transit, planetfall, etc.)?
MD In all phases I was the chief engineer—I acted as engineer, consultant, and the only technical support contact. This included responding while on vacation, even in remote areas (I had a laptop and a cellphone, and took them everywhere). The only difference is that while the mission was on the ground, I had a little more time to respond; once it was in flight, any problem encountered inherited new urgency.
GNN Your primary focus for the MER project was the porting of VxWorks to the Rad6000, right? Did you also work on applications for MER? What were the typical support issues you ran into? Can you give us an example of a call you might have received from NASA at this phase?
MD My primary role was to update and maintain the software and extend it as needed by the MER team. In January, when the Spirit rover suffered from the file-systems anomaly, I was called in to help diagnose the problem, and as we understood more about it, to help characterize the exact nature and extent of the problem. In everyday terms, we had two sets of “buckets” to put data into: a very big bucket for long-term holding (a bank of flash memory), and a set of smaller buckets for temporary holding (blocks of RAM used to “cache” data until it could be moved to the flash bank.) The software that managed the set of smaller buckets was allowed to ask for more buckets as needed; eventually, the system just ran out of space to make more small buckets, and the whole process of managing buckets was shut down. This precipitated other problems, which led to the cycle of rebooting.GNN What makes writing code for spacecraft hard?
MD Writing the code for spacecraft is no harder than for any other realtime life- or mission-critical application. The thing that is hard is debugging a problem from another planet: you can’t put your hands on the malfunctioning system to see what’s going on; you must use intuition and experience.
GNN How do you debug problems on the ground versus in space?
MD On the ground, you use all the tools you can put your hands on, including software debugging tools (WindView, shared memory dumps, software source-level debuggers, etc.). From off-planet, you mostly think about the problem and run tests with what you have in the lab to see if you can re-create the symptoms.
GNN Can you give us an example of a problem you had to debug for MER? How would you fix a problem—upload a patch, or a whole new version of everything (operating system, apps)?
MD In the case of the Spirit rover file-systems problem (bucket managing), the team at JPL realized this might be a possible problem. They did their best to address it, by sending up routines designed to clean out older files, freeing up space both in the long-term storage and in the set of smaller buckets. The day after sending up those routines, the team found out that not all of the routines had made it to the rover intact. The routines were to be sent up to the rover again on “Sol 19” (the 19th day of operation on Mars). Unfortunately, on Sol 18, the problem occurred.
What we did as a team was first to diagnose and characterize the problem as completely as possible, then test ways to detect and prevent it. We realized that some of the testing in the lab wasn’t exactly the same as what was happening on Mars. The team made a combination of changes based on the differences in environments and the work we did to prevent the problem from occurring again. These changes were tested in the lab, verified to provide relief from the problem, and then sent to both rovers (Spirit and Opportunity). The fix mostly affected applications. It should be noted that part of the problem was a configuration issue—the system is extremely complex and there are numerous items that can be configured to react in specific ways. All of the configured items worked exactly as they had been configured to.GNN How do differences in space-based hardware affect what the software sees or does?
MD In the best theory, you will test what you write, and fly what you test. In reality, sometimes that may not be possible. Having said that, if you have two pieces of hardware that are identical except that one is hardened for space flight, the software should run identically on both pieces of hardware.
GNN Is hardening a spacecraft the same thing as hardening a processor, or are there additional steps?
MD Hardening a spacecraft is much easier than hardening a processor. The steps to harden a processor can take years, and it requires testing and reworking, iterated several times to create a processor that is both radiation-hardened and functional. This is a costly and time-consuming process. These are the reasons why rad-hard processors are so far behind consumer-grade processors. For instance, the current state-of-the-art rad-hard processor is a PPC750-based chip that runs at 130 megahertz, whereas you can buy consumer-grade PPC750s that run at well over 1,000 megahertz.
To harden a spacecraft, you just need to add more layers of “stuff” that makes it harder for the protons (etc.) to get into the “guts” of the craft. The problem with this approach is that each layer adds weight to the craft; more weight means you need more thrust to get it into orbit. More thrust means stronger rockets and more propellant, which in turn adds more weight. The cost goes up dramatically as weight is added.
GNN What kinds of applications are placed on top of the operating system in a spacecraft?
MD In the case of the rovers, the applications were of three different natures. The first set of applications was designed to get the craft off of Earth and out to Mars; the second set to get the craft out of space and landed on Mars safely; the third set was how to be a robot geologist and accomplish the main goals of the project: looking for signs of water.
GNN How is application programming done for a spacecraft?
MD Much the same as for anything else—software requirements are written, with specifications and test plans, then the software is written and tested, problems are fixed, and eventually it’s sent off to do its job. In the case of satellites, you want to be extra cautious about designing and implementing software, and diligent with testing the software, to make it as robust as possible before launch. It is almost impossible to schedule an on-site visit once the craft is on its way.
GNN Were there any specific architectural challenges or difficult requirements to meet when programming the operating system or apps for MER? Are there unique techniques that are necessitated by the fact the platform is a spacecraft—for instance, techniques that will help with remote debugging?
MD The processor I had was able to run at two speeds; I took it upon myself to make sure that all the tests behaved as expected at both speeds. I also adapted and ran sets of tests that corresponded to software updates I applied to the operating system and kernel, and additionally ran benchmark tests to help characterize how the processor would run under processing loads.
JPL invented methods to acquire debugging information from the system when Mars Pathfinder was engineered. It did this by exploiting the flexibility of the operating system and knowledge of how the system works (the application binary interface for the processor). To do this, JPL had to have access to all of the source code and the ability to rebuild from the source code. This is a crucial feature—it gives the engineering team the ability to research the deepest reaches of the code. In general, the software packages delivered by Wind River were not designed to be handled in that way; I went out of my way to make the package delivered to MER complete enough so it could be used as needed, including rebuilding the entire package on-site if necessary.GNN How much code is written at JPL versus code written at Wind River?
MD The operating system and kernel fit in less than 2 megabytes; the rest of the code, plus data space, eventually exceeded 30 megabytes.
GNN What kinds of software tools are used to write code for a spacecraft?
MD The most common “tools” are the engineers, I suppose. Each engineer has a preferred platform-host type (Sun or PC), editors (vi, Emacs, homemade, etc.), and methodology.
GNN How is QA (quality assurance) on the code that is sent to NASA different from what other customers get?
MD There are standard test suites run against all ports of VxWorks. In the case of the space software, I ran and supplied additional tests for each set of bugs fixed or extensions added. I would also respond to any anomaly report by trying to re-create the conditions seen by the customer—and use that information either to explain the results or to identify and fix problems as soon as I possibly could.
GNN Does NASA retest everything it receives from a vendor?
MD JPL, in this case, tested the parts it used, and verified results. Other customers do similar things, sometimes verifying results by testing the same routines on other operating systems or processor platforms.
GNN How is the operating system provided to NASA different from what other Wind River customers receive?
MD In the case of MER, it is unique. There were many fixes and enhancements made for the Spitzer Space Telescope (also known as SIRTF—the Space Infrared Telescope Facility); those enhancements were implemented to make as little change to the operating system as possible. Changes included improved mathematical precision for some trigonometric routines, improved handling of the I/O system, and several bug fixes. All of those enhancements were brought into the MER code-line. The MER code was then ported to be built with the latest set of tools (including an updated C++ front end), retested, and delivered to the MER team. [Please see http://www.spitzer.caltech.edu/ for more information on Spitzer.]
The code for MER was also made to be buildable on-site, without the Wind River infrastructure. This is a very much a nonstandard practice.GNN Are there security concerns with space-based software? Has anyone tried to hack a space mission?
MD The problem is that the only way to communicate with the spacecraft is though the antennas of NASA’s Deep Space Network. It would be hard to build an antenna the size of a football stadium in your back yard—and not be noticed.
GNN What about another country taking over someone’s craft. Lots of countries—some friendly, perhaps some less so—have big telescopes. Do people worry about a rogue state hijacking probes?
MD As far as I know, several countries have “spied” on space missions over time—listening to signals, decoding transmissions, and perhaps even “jamming” signals. This must have been particularly notable during the Cold War. This is not a military mission, however, and the results will be publicly available—in most cases very soon after the data is brought back to Earth. The only reason to “hack” into the craft would be out of malice. It would still require vast resources to achieve the goal of hacking into either rover, and those resources would be very obvious. A number of protocols are involved: radio frequencies, rates at which data must be transmitted/received, and other factors that would make it nearly impossible to “hijack” a space probe. Even if you had all of the information necessary to do this, you would still have to be able to point your earth-bound antenna accurately enough to get your transmissions to the intended target.
These days, the public regards space missions almost as casual events. In reality, they are far from trivial, and even the smallest launch requires the teamwork of hundreds, if not thousands, of individuals making their parts work as flawlessly as possible. I am completely amazed by the dedication and professionalism exhibited by every member of the MER team, whether they were at JPL, Cornell, or even the shops making ball bearings or sewing airbags. Without that level of commitment, the mission would not be as successful as it is. I think it would be safe to say that all space-borne projects are as intense. Every space-borne project I’ve been involved with has had similarly dedicated staffs who overcome seemingly impossible problems to reach the mission goals—even the ones that didn’t work out well.GNN What special tricks are there to handle problems in software that is several light-minutes from your desktop?
MD It’s mostly intuition and experience: “Here’s what we know, here are the symptoms, here’s a set of conjectures about what could be happening.” When given this kind of problem, you eliminate the improbable and work forward, trying to test scenarios to see which make sense and which don’t.
GNN What scary problems have been caught on the ground, before the mission went into space?
MD There were many problems found on the ground, even at Wind River. I wouldn’t call any of them “scary,” though some were nontrivial.
Many years ago, while Stardust was still on the ground, a problem was found with the compiler tools. It was mishandling one of the registers, overwriting a value before storing what had been in the register. This had the potential to make the entire system into a really expensive random number generator—not what you want from your spacecraft. The tools were fixed, the entire set of releases currently in use was rebuilt with the fixed tools, and updates were sent to all the customers using the software at the time.
GNN What is the hardest problem you’ve had to debug during a mission?
MD Two problems: the Mars Pathfinder priority inversion problem and the MER file-systems anomaly. Both were solved by some of the most brilliant engineers I’ve had the pleasure to work with, working as teams. I think Glenn Reeves [Mars Pathfinder Flight Software Cognizant Engineer] does the best job of explaining what went wrong and how it was fixed [see http://research.microsoft.com/~mbj/Mars_Pathfinder/Authoritative_Account.html.] [Note: the Mike most often referred to in this online document is Mike Jones of Microsoft; Mike Deliman is from Wind River.]
GNN Obviously there was a huge news frenzy as one of the rovers became inoperable for a couple of weeks. What went wrong—and why did it take so long to fix? Was it one of those situations where once you figured out what the problem was, it was easy to fix? Or was it just a real difficult thing to fix, requiring lots of work?
MD It certainly wasn’t an easy problem to diagnose or rectify.
There were many aspects of diagnosing and addressing the problem with the Spirit rover, which occurred in mid-January. There were many possible problems: it could have been a power surge, radiation from space, an intermittent wiring short, thermal-related problems, mechanical failures precipitated from launch or landing, etc. We had the task before us of eliminating the least likely, characterizing the most likely, and paring down the list of possibilities into a manageable set of probable causes, and then exploring those causes. Once the problem was accurately diagnosed and characterized, it had to be simulated in the lab, and the remedy still had to be implemented and tested.
Though I wasn’t directly responsible for implementing or testing the remedy, I did assist the team as much as I possibly could. For me, the call to help came literally 20 minutes after Opportunity had landed on Mars. It required research into the source code, discussions with experts in three time zones: Japan, California, and Gusev (that’s where Spirit landed on Mars), and taking copies of the work wherever I went so I could access it. There were many days of long hours, working late into the night, creating and running tests. I worked through weekends, woke up three times a day to make the contacts I needed to make, took breaks only for meals, sleep, showers, and enough time to care for my dogs.
I know the rest of the team was just as dedicated and focused, and put in at least as much effort. We all had to do our best to handle the situation and juggle our own family requirements and personal difficulties. Throughout the effort, I had numerous distractions (other projects approaching deadlines, requests from the media, explanations and status updates to be sent to the management teams and executives, and a death in the family). I don’t expect any of the team had an easy time handling their parts of the mission. I am extremely proud of our achievement and very thankful to have received so much support from friends and coworkers who made it possible for me to make my contributions.GNN Has the design of a spacecraft ever affected the operating system software? Do things you learn working with NASA/JPL wind up in the base operating system code?
MD Some of the things we fixed for various space missions did get folded back into the base package. Of note, with several support engineers at Wind River, we fixed some of the math routines for a space customer; the resulting routines were every bit as accurate as the IEEE 754 versions, and had better timing characteristics.
GNN Do you think NASA/JPL would switch to an open source operating system for a space mission?
MD I would not rule it out, but the fact is, when you’re dealing with a billion dollars worth of hardware and many man-years worth of effort, you tend to go with what you know will work. As an example, let’s look at the Rad6000 processor. It’s a 32-bit computer that runs at 20 megahertz and was the pinnacle of technology perhaps in 1990. It has a limited amount of RAM, and that RAM is pretty slow by today’s standards. It’s at best a relic compared with today’s processors that run hundreds of times faster.
Even though it’s an old design, the Rad6000 is very well understood, has been used in many successful space missions, and is still in use in several others. This history of success builds up a good reputation, which in turn translates into confidence. If you’re confident in the basic platform at the heart of your satellite, you can feel more confident about the satellite surviving to achieve its goals.
LOVE IT, HATE IT? LET US KNOW
firstname.lastname@example.org or www.acmqueue.com/forums
© 2004 ACM 1542-7730/04/1000 $5.00
Originally published in Queue vol. 2, no. 7—
see this item in the ACM Digital Library
- George V. Neville-Neil works on networking and operating system code for fun and profit, and also teaches courses on various subjects related to programming. His areas of interest are code spelunking, operating systems, and networking. He earned his bachelor's degree in computer science at Northeastern University in Boston, Massachusetts. He is a member of the ACM, the USENIX Association, and the IEEE. He is an avid bicyclist and traveler who currently resides in New York City.
For additional information see the ACM Digital Library Author Page for: George Neville-Neil