Programming without a Net
Embedded systems programming presents special challenges to engineers unfamiliar with that environment.
George V. Neville-Neil, Neville-Neil Consulting
Embedded systems programming presents special challenges to engineers unfamiliar with that environment. In some ways it is closer to working inside an operating system kernel than writing an application for use on the desktop. Here’s what to look out for.
What if your programs didn’t exit when they accidentally accessed a NULL pointer? What if all their global variables were seen by all the other applications in the system? Do you check how much memory your programs use? Unlike more traditional software platforms, embedded systems provide programmers with little protection against these and many other types of problems. This is not done capriciously, just to make working with them more difficult. Traditional software platforms, those that support a process model, exact a large price in terms of total system complexity, program response time, memory requirements, and execution speed.
The trade-offs between safety and responsiveness form a continuum [see Figure 1]. At one extreme is programming on bare hardware, without any supporting operating system or libraries. Here programmers have total control over every aspect of the system but must also provide all their own runtime checking for program errors.
The other extreme is application programming on a system such as Unix (which includes all variants such as BSD, OS/X, Linux, HP-UX, AIX, etc.) or Windows XP, where an operating system mediates application access to resources such as the CPU, memory, and secondary storage. Applications running on such a system are provided with the illusion that they are the only users of the system’s resources. This illusion is maintained because the operating system provides a container for the application called a process. Any system that provides this type of container is a Process Model Operating System (PMOS). PMOSs depend on having memory protection to arbitrate access to physical memory, as well as a virtual memory (VM) system that provides the illusion of an unbroken address space to each program. The memory protection mechanisms are often provided through the VM system, but these two concepts should remain distinct. It is certainly possible to provide memory protection without having virtual memory and vice versa, but it is only in PMOSs that the two are conjoined.
Embedded systems occupy a space in the middle of the continuum. Most embedded systems are not written on bare metal. There is often an embedded operating system (EOS) that provides services such as a scheduler and a set of libraries to aid programmers in building their applications.
Two trends complicate this discussion. There are now several examples of “embedded” Linux or BSD. These are stripped-down versions of their larger cousins that have been adapted for embedded use. From the other end of the spectrum come traditional embedded systems that have had some form of memory protection added to them (Wind River’s VxWorks AE). For the purposes of this discussion we are concerned with traditional embedded systems, those that come from the low safety end of the spectrum and trace their lineage from realtime operating systems. Although some of the difficulties in working with traditional embedded operating systems present themselves when putting Linux or BSD into an embedded context, these OSs preserve a process model for the application programmer.
Programmers working with these middle-way systems should still concern themselves with how traditional embedded programming is done for two reasons: First, knowing how embedded programmers think about solving problems will provide insight into the design of embedded software that you will come across in the system. Second, even when you are working within a process model context, once the design goes embedded, the likelihood that you’ll have to work with some part of the system beneath the user/kernel boundary increases manyfold.
PMOS SAFETY DESCRIPTION
Even though PMOSs have been known and used for many years, it makes sense to explain the ways in which they protect programmers. In a PMOS the application cannot touch or otherwise modify physical memory directly. All memory accesses by the application are for virtual addresses, which the PMOS’s VM system translates into physical addresses. Several checks to catch program errors are made during the virtual-to-physical translation. For example, the referenced address must be known to be a valid one for the program, and it must not be 0 (NULL).
The process container has several sections: one each for the program instructions (text), initialized data, zeroed data for uninitialized variables (bss), a common memory pool (heap), and the stack. Each of these sections has different protections set on it, which also help in catching programming errors. For example, a program attempting to directly modify its program text will be terminated with an error because that area is marked as read-only. Global variables are kept in the heap and are global to only the application itself; they do not pollute the space of other applications or the operating system kernel.
VM systems, when coupled to a secondary storage device such as a disk, provide another illusion to the programmer: that the application can use more memory than is physically present in the machine. Each application believes that its memory extends from an address just greater than 0 to the last byte that can be addressed by the CPU.
Embedded operating systems do not support a process model. They provide a single address space in which memory addresses map directly to physical locations in RAM and the operating system does little or nothing to protect programs from interfering with each other or with the OS itself. On CPUs that support hardware memory management some EOSs will protect the text pages of a program from modification, but this is a far cry from the protection a PMOS provides.
EOSs come from the low-safety, highly responsive end of the speed-vs.-safety continuum. They were originally designed as realtime systems that could react to external stimuli in a deterministic amount of time. To achieve deterministic response times, common operations cannot require a variable amount of time. Many VM operations, such as a virtual-to-physical translation or paging memory back from disk are non-deterministic. Although many EOSs are now used outside of the hard realtime area, they retain their original design in which determinism was the most important goal.
Systems that support a process model are the direct descendents of timesharing systems. The goal of timesharing is to serve a large community of users and to share the computing resources as fairly as possible, while also making sure that users do not step on each other’s toes. The designers of these systems were willing to pay the costs associated with added complexity to ensure a high level of safety.
Timesharing systems are general platforms where arbitrary applications are run by a community of human users. In this type of environment it is rare for two applications to work together to complete a job. Although the output of one program might become the input to another, this is a limited form of cooperation.
In contrast to timesharing systems, embedded systems often contain several applications that are all supposed to work together to get a single job done (for example, a network router runs routing protocols, management applications, and performs packet forwarding). This expectation of cooperation combined with the need for deterministic response means that intertask communication mechanisms must have a very low overhead. Having all tasks execute in a single address space keeps communication overhead low. When necessary, tasks share large regions of data simply by passing pointers to each other. This is the most common form of intertask communication employed in embedded systems programming.
Programming on an embedded system has a lot of similarities to multithreaded programming. It shares all of its pitfalls and adds a few more. When a thread fails inside a program running on a PMOS, it is that single program that fails, not the entire system.
The dangers involved in writing code for embedded systems can be broken down into several general areas: Pointer Mistakes, Namespace Pollution, Inappropriate Sharing, Memory Bloat/Memory Requirements, Code Inefficiency, and Task Scheduling.
Pointer Mistakes. The majority of programs for embedded systems are written in C, with a very small percentage in other languages, such as C++ and Java. Common programming errors in C include accessing a NULL pointer reference and accessing a garbage pointer.
In many embedded systems the address 0 is valid both for reading and writing. For example, it is common for devices to keep their memory in the first few pages of RAM, and they might store some data at address 0. Some processors keep their interrupt jump table in the first page of memory and place the system reset interrupt vector at address 0. This means that writing over a NULL pointer can cause errors that seem unrelated to any program in the system. The system may simply crash, perform a hard reset, or devices may start spewing garbage data.
Using a garbage pointer, one that is not 0 but is also not valid for any of the program’s data, can cause several different types of errors. Reading data from the wrong address will give a wrong answer to a calculation. Writing to a garbage pointer will not produce an error but may overwrite data that belongs to another task in the system. In the case where the EOS does not mark memory pages that contain program instructions as read-only, writing a garbage pointer can crash an unrelated program because its instructions were corrupted. Checking the program that crashed for a bug will not reveal anything, because it was not the cause of the problem; some other program was responsible.
Namespace Pollution. In a single address space all code and data share the same memory space, including the EOS itself. This means that all global data and nonstatic functions are visible to all the other code in the system. Poor choices for variables or function names will result in namespace collisions. This is not so much of a problem for functions because a linker will note a collision between two identically named functions. In a system with runtime, dynamic linking, this error is more serious because it prevents code from being loaded into the system. This is a catastrophic error when upgrading a system in the field because at this point fixing the problem may be impossible.
A more serious problem is duplicate global variables. These are not detected by the linker or the OS. In a single address space an extra reference to a global variable is just that, an extra reference. It is important to choose names carefully to avoid conflicts. Particularly large headaches come from choosing single letter names for global variables. Experienced embedded systems programmers will, if they must use global variables, place them in a structure pointed to by a single, descriptive global. The extra indirection in accessing the data is a small price to pay in exchange for not polluting the namespace and possibly causing unintended side effects later on.
Inappropriate Sharing. The ability to communicate quickly among tasks comes with a need to protect shared data effectively. The most common way to share information inside an embedded system is to pass the address to a chunk of memory between a set of cooperating tasks. For this to work, the tasks sharing the memory must agree on a protocol by which they access the shared memory.
Most EOSs provide basic mutexes, such as semaphores, as part of the system. The key is to use what’s provided correctly. If one task causes a deadlock, all of the other tasks, and perhaps the entire system, cannot make progress and they fail. Recovering from this error at this point may require a system reset, or watchdogs may periodically check the health of certain subsystems and restart them when necessary. Obviously, avoiding the situation in the first place is preferable.
Memory Bloat/Memory Requirements. In a system with virtual memory, chunks of memory can continue to be allocated far beyond the physical memory of the system. Although this causes degradation of system performance, it rarely results in programs crashing. On an embedded system, once the physical memory is exhausted, all further attempts at memory allocation will fail. It is tempting to think that the economics of hardware have alleviated this problem with computers having up to 4GB of physical memory. In the embedded world the more common number is 16- to 32MB of RAM, which, though still generous, can easily be used up if programmers are not careful with their allocations. Without virtual memory backed by secondary storage, the EOS can’t do anything to help the application when physical memory is exhausted. All it can do is tell the program that the memory it wants is not available.
On an embedded system, memory bloat is a serious problem. Each allocation must live for only as long as necessary and should be freed as soon as possible. Defensive programming, in the form of allocating large chunks of memory just in case they are needed later, is not a good policy. Programs in this environment must use only the memory they need so that all the cooperating tasks can have enough space in which to do their work. Often, memory pools are associated with particular subsystems so that the problem can be contained, but this only ameliorates the problem and does not eliminate it. For those used to programming with the illusion of infinite memory, this kind of discipline can be difficult at first.
Code Efficiency. Industry pundits frequently cite Moore’s Law as evidence that code efficiency is not of paramount importance. What cannot be done by today’s processors will be easily achieved by tomorrow’s. Embedded systems have certainly benefited from this largesse just as the rest of the software world, but this is not the whole picture. Embedded devices, unlike their larger desktop cousins, are often constrained by physical size, heat dissipation, and power requirements. Solving a problem by increasing the processor speed is frequently not possible. Imagine trying to cram a Pentium processor into a cellphone. What this means for programmers of embedded systems is that efficient code is still a top priority.
Task Scheduling. Scheduling tasks inside of an EOS also holds danger for embedded programmers. On a traditional OS platform, programmers never have to think about the scheduling of their programs. Even when writing multithreaded programs using a standard threads package, scheduling considerations are minor. This is because both traditional OSs and thread packages use round robin as their default scheduling policy. This scheduling policy makes sure that each program or thread gets its turn to run, as long as it is not blocked on some external event (I/O, waiting for a mutex, etc.).
EOSs, by default, use a preemptive priority scheduler. Each task in the system has a priority assigned to it, and the highest priority task runs until it blocks or completes. In this type of environment the work each task does, its importance, how often, and for how long it should run must all be well understood by all the programmers adding code to the system.
One of the best known pitfalls is priority inversion. This occurs when two tasks are cooperating on some piece of work but the one with the higher priority is waiting on a piece of information from the lower priority task. Because of the nature of preemptive priority scheduling, if the higher priority task does not give up the CPU while waiting for the data from the lower priority task, neither can make any progress and they remain deadlocked until a human intervenes. Some EOSs provide special mutexes that detect this situation and temporarily allow the lower priority task to inherit the higher priority task’s priority level until it completes its work and the higher priority task can run again. Though this can be a lifesaver, it is also inefficient because the EOS has to do extra work every time this happens. It’s much better to get the scheduling of tasks right the first time.
Writing code for an embedded system is hard enough, but what about integration of third-party code? Your first brush with embedded systems is far more likely to involve integrating code than writing brand new code. Code integration comes in three flavors: integrating a third-party component meant for the EOS that you are working with, porting an application from a PMOS system, and porting code that was written for a PMOS kernel.
Integrating code for an EOS. When you execute a program on a PMOS, it runs in a container and doesn’t interfere with anything else running in the system. Installing the program might accidentally overwrite certain files or corrupt a registry key, but these errors are reasonably easy to protect against. Adding a new application to an embedded system is the same as linking a new chunk of code into a preexisting program. All the new code’s symbols must be unique and any external references must be resolved before the new code can become active. If the code shares any information with the EOS or other applications, it must use the same protection protocols—for example, the order in which it takes and releases locks—as any preexisting code in the system.
The necessity of linking the new code directly into the system, whether dynamically or statically, requires a full retest and validation of the total system to verify that the new code did not cause unwanted side effects.
Porting code directly from a PMOS. This form of code integration is often seen as a quick way to get something up and running on an EOS, but this is often a mistaken idea. Software created for PMOSs was written with the assumption that it would be run on a system with a process model. These programs often contain tens, if not hundreds, of poorly named global variables and functions and have interfaces that they don’t expect to be exposed to others.
The first decision that has to be made is whether or not it will be more work to port the PMOS code or to write the same thing from scratch. The most difficult part of this decision is not coming to a technical understanding yourself, but convincing a manager that porting is not an effective solution. The fact that the code exists, and is available for use, is enough to convince most managers that it should be used.
Before committing to porting PMOS code to an EOS, you need to study several parameters of the code. These are all related to the issues presented in the previous section on writing code for embedded systems and involve the following: looking for and isolating global data, preventing namespace pollution, ferreting out memory bloat, and making sure that the program can be correctly scheduled with the other preexisting tasks in the system.
Porting code from a PMOS kernel. A common example is porting the BSD TCP/IP stack for use in an embedded system. This code may look self-contained and easily transportable, but that is not necessarily so.
Most PMOS kernels are monolithic, meaning that data structures in the kernel do not have to be protected from other parts of the system. When different parts of the kernel need to share data, they use simple code-locking techniques to make sure that critical information is updated correctly. Kernel code also expects to be called only through the system call interface. In an EOS the ported code could be called at any arbitrary point (that is, any place where a function is not declared static). The integrated code acts more as a library than as a piece of an OS kernel.
Monolithic kernels do not support threads internally. This means that many services are not run as threads but are scheduled by a device interrupt or a timer going off. Having large chunks of code execute at an interrupt context, such as is done in a PMOS kernel, is going to destroy any determinacy that was present in the original embedded system. If none of the processing is going to be done in the device driver’s interrupt handler routine, then a task must be created to handle this work.
Embedded systems programming presents a number of challenges to software engineers who have been reared on traditional PMOSs. Some of these challenges are a result of the history and requirements of embedded systems and some result from learned practices in writing applications on a PMOS. The intent here was to give you a sense of what writing code for an embedded system is like, so that with this information you can work within the embedded world, as well as in the non-embedded one.
Originally published in Queue vol. 1, no. 2—
LOVE IT, HATE IT? LET US KNOW
GEORGE V. NEVILLE-NEIL, email@example.com, works on networking and operating system code for fun and profit, and also teaches courses on various subjects related to programming. His areas of interest are code spelunking, operating systems, and networking. He earned his bachelor's degree in computer science at Northeastern University in Boston, Massachusetts. He is a member of the ACM, the USENIX Association, and the IEEE. He is an avid bicyclist and traveler who currently resides in New York City.
© 2012 ACM 1542-7730/03/0200 $5.00
Originally published in Queue vol. 1, no. 2—
see this item in the ACM Digital Library
- George V. Neville-Neil works on networking and operating system code for fun and profit, and also teaches courses on various subjects related to programming. His areas of interest are code spelunking, operating systems, and networking. He earned his bachelor's degree in computer science at Northeastern University in Boston, Massachusetts. He is a member of the ACM, the USENIX Association, and the IEEE. He is an avid bicyclist and traveler who currently resides in New York City.
For additional information see the ACM Digital Library Author Page for: George Neville-Neil