Division of Labor in Embedded Systems
You can choose among several strategies for partitioning an embedded application over incoherent processor cores. Here’s practical advice on the advantages and pitfalls of each.
Increasingly, embedded applications require more processing power than can be supplied by a single processor, even a heavily pipelined one that uses a high-performance architecture such as very long instruction word (VLIW) or superscalar. Simply driving up the clock is often prohibitive in the embedded world because higher clocks require proportionally more power, a commodity often scarce in embedded systems. Multiprocessing, where the application is run on two or more processors concurrently, is the natural route to ever more processor cycles within a fixed power budget.
Today these plural processors usually reside as cores in a single chip as part of a system-on-a-chip (SoC) solution, frequently together with control circuitry, local memory, and large chunks of dedicated non-programable custom logic, collectively called peripherals. In general the cores in multicore embedded designs are incoherent—i.e., there is no hardware means by which the cores maintain a consistent collective view of the rest of the system and in particular of the contents of memory.
The sources of incoherency are usually the caches, in which each core keeps a local copy of memory contents of interest. A value written by a core will usually not appear in memory shared among the cores until long after the write. In the meantime, other cores referencing the location will receive stale data. Incoherency is present even in write-through systems that write to shared memory at once, because another core may already have a cached copy of the prior (now stale) value, which it uses in post-write reads. Multiprocessing server and workstation systems go to great hardware lengths to preclude such incoherency, but these steps are almost never present in the embedded world.
In some multicore hardware designs the cores have differing instruction set architectures. Commonly, one core will be a conventional RISC, present to run legacy software or an operating system, whereas one or more others are digital signal processors (DSPs) to do the heavy lifting. Applications that need more processing capacity than is offered by a single core must be partitioned across the several cores. There are several possible applica-tion partitioning strategies. Presented in rough order of increasing implementation complexity and performance, these are:
- Independent applications. Each core runs an isolated application using its own statically dedicated set of system resources.
- Shared code applications. As above, but the independent applications share read-only resources such as memory occupied by code.
- Weakly coupled applications. As above, but the independent applications (and possibly hardware peripherals) communicate via a pool of buffers in producer/consumer fashion.
- Microtask lists. As above, but the application is broken into many independent applications that are dynamically scheduled onto the cores and run to completion. The microtasks, while running, are weakly coupled and do not share system resources concurrently.
- Symmetric multiprocessing (SMP). As above, except that a microtask potentially may not terminate, can be preempted, and may be resumed on a different core after preemption.
- Multithreading. As above, except that the microtasks potentially share system resources arbitrarily.
The following discussion refers to dual-core systems, but the strategies are applicable to any number of cores.
INDEPENDENT APPLICATIONS STRATEGY
The intended system may naturally present itself as a set of completely independent applications, not interacting in any way, which can be conveniently mapped one-to-one to the available cores. For example, one application may manage a video flow through video ports, while a second one may manage an unrelated audio flow through audio ports, where each flow fits in the processing capacity of a single core. Such a system could be implemented as two separate single-core chips, each with the appropriate peripheral ports, memory, and so on, but the system could also be combined onto a single dual-core SoC for economy.
In this strategy, the two applications are developed independently as separate programs and loaded independently to memory. If both applications use the services of a realtime operating system (RTOS), then each contains the RTOS code. Similarly, there are two copies (one in each linked application) of the C library or other code common to the applications. That is, common code is not shared. Boot configuration starts each core at the entry point of the corresponding application program.
System resources (such as peripherals or memory) are statically allocated to each application in this strategy. For example, DRAM address space is split into parts, one containing the code and data of the application running in one core, and the others containing the code and data running in the other cores. Likewise, the memory mapped input/output (MMIO) space of the peripherals, system control registers, and so on is apportioned among the two applications. Lastly, interrupt enables are configured so that peripheral interrupts are routed to the core that has been assigned the interrupting peripheral.
Depending on the configuration capabilities of the on-chip memory controller and buses, it may be possible to configure the chip so that the partitioning is not only of function but also of visibility. That is, it may be possible to configure the chip so that the resources dedicated to a core are not even visible or addressable by other cores. The implementation should use such a visibility control facility if it is present. Debugging a multicore application is much easier when you know that code in one core cannot erroneously overwrite data used by another.
Likewise, on some chips it is possible to bias addresses as seen or issued by a core so as to place the core address space at an arbitrary position (an aperture) in the SoC address space and, hence, physical memory. This is a limited form of virtual memory, in that the virtual address seen by the application is not the same as the physical address seen by memory, although there is typically no means to change the configuration at runtime and no support for demand paging.
If such a biasing facility is present in the hardware, then it is often convenient to have all the independent applications use a single virtual base address, although necessarily different biased physical addresses. This permits preexisting monocore applications to run in one core of a multicore SoC without recompilation or source changes to the hardwired address values that are often present in legacy embedded applications.
The major advantage of this independent applications strategy is its simplicity. Applications can be developed in isolation, debugged using only a single core or on a monocore chip of the same architecture, and then loaded to the multicore chip where they will execute in what the application sees as a monocore environment. A second advantage is improved robustness and reliability, with a consequential reduction of debug effort. If (as recommended) the chip is boot configured to isolate the resources used by each application, then an error in one application cannot overwrite data used by the other, or otherwise interfere with its operation.
This strategy has a number of drawbacks, however. The applications are unable to share resources such as a peripheral. Memory will be wasted because of duplicated code in the two applications. Perhaps most important, this strategy provides no way to balance loads across the two applications. This leads to a waste of processor cycles because a lightly loaded core cannot help with the processor peaks of the other application.
In practice this strategy may be dictated when the core set is heterogenous and each core uses a different instruction set. Then execution cannot migrate across cores, although the data sharing of strategies discussed later may still be possible. All subsequent strategies assume a core set that is at least partly homogenous.
SHARED CODE APPLICATIONS STRATEGY
The shared code applications strategy is identical to the independent applications strategy except that the memory used for instructions and constant data is shared across the two applications. There is no sharing of mutable data. To implement this strategy, the two applications may be developed in isolation, but must then be linked together and loaded to the chip as a unit. The apertures (assuming the hardware supports the concept) are set up so that one aperture holds the code and both cores are given access to that aperture. Peripheral MMIO space and read/write (data) memory are partitioned, as with the independent applications strategy.
The advantage of this strategy is the reduction of memory usage as a result of code sharing. So long as neither application contains self-modifying code, then the development process and benefits are essentially the same as for the independent application strategy. However, any code that is either self-modifying, dynamically generated, or in a disparate instruction set must exist in separate copies for each core. This requirement may force source code to be modified so that all such code is entered by a dynamic address (such as a function pointer variable) rather than by a static address set by the linker.
WEAKLY COUPLED APPLICATIONS STRATEGY
The weakly coupled applications strategy is similar to the independent applications strategy in that application functionality remains statically partitioned across the cores and no data is concurrently shared among the cores. However, data is communicated between the cores, usually in the form of Ping-Pong buffers. This data communication is essentially identical to that between a core and a peripheral in monocore stream processing systems; in the weakly coupled strategy each core essentially sees the other as a (smart) peripheral.
Typically, data enters the system through a hardware peripheral (such as a video input unit that digitizes an analog video signal) and is placed in a buffer. When the buffer fills, the peripheral switches to filling an alternate buffer and signals the transition by an interrupt. That interrupt is fielded by the core that has been statically dedicated to the next processing function for the input data, which then processes the buffer full of input data and places the result of the computation into an intermediate buffer different from those used for inputs between the peripheral and the core.
When the intermediate buffer holding the output of the first core fills, or some other processing condition occurs, the first core switches to an alternate buffer and in turn signals the transition by a (different) interrupt. That interrupt is fielded by the second core, which processes the buffer from the first core and places the results into an output buffer. When the output buffer fills (or another condition occurs), the second core switches to an alternate output buffer and uses MMIO to trigger an output peripheral (such as a video output unit that converts the modified digital video to analog form) to send the buffer full of output off the chip.
Thus, the whole application is statically partitioned into a bucket brigade of statically defined processing steps involving both peripherals and the cores. The steps communicate by buffers in memory, but at any time only one step is processing in any one buffer. That is, there is no concurrent access to any part of the data, and the system components communicate by message passing in the form of buffers.
This weakly coupled strategy will be familiar to users of monocore stream processors in that it is identical to the handling of data flow between peripherals and the core in a monocore. As such, it is subject to the same constraints, the most important of which is the need to manually control memory coherence in the presence of data caching in the core.
The data cache maintains a local copy of data, which has been read from memory. The cache may be a writeback cache, meaning that writes performed by the core are captured in the cache and are not immediately reflected into physical DRAM. Only when a modified (“dirty”) cache line is evicted from the cache does a write by the core become visible to other memory users.
The use of a writeback cache imposes a discipline requirement on applications that share or communicate memory data between a core and some other device such as a peripheral or a second core. In particular, the application must ensure that any writes by a core to data to be shared or communicated must be flushed from the cache before the consumer attempts to read the data. Without this discipline, the consumer may read the stale former value from memory instead of the updated value, which is still in the cache, and so produce erroneous results from stale data.
It is tempting to assume that this stale data problem cannot occur with write-through caches, where a write is propagated to memory at once instead of being held for later eviction from the cache. However, the application designer must still exercise care to ensure coherency even with write-through caches, because a write is not instantaneous. If a core writes and then immediately signals the consumer to begin processing the written data, it might be possible for the consumer to read the written location before the written data actually gets there. Whether such a race condition is possible depends on the details of the memory controller’s priority allocation algorithm. Bugs caused by such a race (if present) are unreproducible and difficult to find. The problem is best avoided by using a hardware special instruction (nearly always present or eas-ily simulated) that blocks notification to the consumer until all outstanding writes have completed.
A similar problem occurs with input buffers passed to a core. That is, a location in an input buffer may also be present in the cache as a result of earlier processing in the same buffer. If the core reads that location, it will receive the stale cache value instead of the memory value as updated by the producer of the buffer. To preclude using stale data, the application must be modified to add explicit cache control operations to discard the stale data and optionally to prefetch the updated data at that location.
Methods of ensuring coherence between memory buffers produced/consumed and the contents of cache include:
- Explicit flush on buffer produced/consumed
- Incremental flush
- Use of uncacheable memory for buffers
- Cache wash
Explicit flush on buffer produced/consumed. Applications using the weakly coupled strategy can add a step to their code to be executed each time an output buffer is to be passed to the next consumer in the array. This step must iterate through the address range of the buffer and use the processor’s cache control instructions to force any dirty lines in that range to be written back to memory. Only when all the writebacks have occurred is it safe to signal the next processing stage that its input is ready.
On input, the application must ensure that the cache does not hold a stale copy of the buffer data consumed. Consequently, it must step through the address range of an input buffer and use the processor’s cache control operations to discard any stale data and optionally prefetch the new data to cache. Note that prefetch will be ineffective if the buffer size prefetched is larger than the cache after giving effect to the working set of other memory usage and the cache replacement policy and wayness.
Incremental flush. Alternatively, if the output buffer is filled sequentially, then the cache flush could be inserted into the data output loop rather than forming a distinct loop at the end. Thus, as the application completes each cache line of output, it executes the cache control operation to flush that individual line to memory. While this does exactly the same number of flush operations as would a flush loop at the end, the flushing is inter-mixed with other processing and may provide better throughput as a result of execution overlap with other processing.
The same incremental approach can be used for input. The advantage is that it can be combined with prefetching of data for better performance, even when the buffer exceeds the size of the cache.
Use of uncacheable memory for buffers. The address range encompassing the buffer can be marked as uncacheable or allocated to an uncached aperture. This will cause any writes to the buffer to bypass the cache and be written directly to memory, and any reads to bypass the cache and be loaded directly from memory to the registers. Bypassing the cache, however, has its own costs in the form of more frequent memory transactions and stalls. The result may be lower performance than using a flush loop. Essentially, all processors with a cache provide some means for references to bypass the cache.
Cache wash. Under certain circumstances you may be able to omit explicit cache control operations. If the buffer is larger than the cache, and the data in the buffer is processed strictly sequentially, then for most cache architectures the normal operation of the cache will wash the buffer contents through the cache. This means that every line is written back (on output) or discarded and reread (on input) before being reused for a subsequent buffer full. In this situation, stale data on input is impossible, and stale data on output is impossible if the output consumer is two or more buffers behind the producer. Omitting cache control in this way can yield a large performance benefit. However, ensuring the preconditions can be algorithmically difficult (especially on output), and the resulting code will tend to break if ported to hardware with a different size cache or different hardware cache behavior.
The major advantage of the weakly coupled strategy is the conceptual simplicity and easy modularization of the application that it provides. The major drawback (besides the cache complications) is the static partitioning of the application workload over the cores. As with the previous strategies, it is impossible in the weakly coupled strategy for an overloaded core to dynamically offload processing to its more lightly loaded mate.
THE MICROTASK LISTS STRATEGY
The microtask lists strategy is similar to the weakly coupled strategy in that any given unit of processing is performed by only one core at any time, and the processing units communicate by passing buffers of data and do not share data. In this strategy, however, the processing units are not performed by statically dedicated hardware but instead are dynamically allocated to available cores for load balancing.
The overall application is broken down (statically) into a number of independent processing steps, each of which can be run to completion independent of all the rest. Each step is a microtask, and each microtask on activation receives a parameter record containing all the data it needs to complete the task, as well as an indication (typically a function pointer) of the code that performs it. That is, the task and its parameter buffer(s) can execute in complete isolation from the rest of the system.
The application creates a global work list, consisting of a queue of microtasks available to run. Each core independently acquires the task at the head of the queue and executes it to completion. It then fetches and executes the next task and so on. New microtasks are usually created and entered on the work list as a byproduct of execution, so that the application becomes a continuous time-ordered stream of microtasks that together implement the application. Which core executes which task is entirely happenstance, and (to the execution granularity of the tasks) the total application load will be automatically balanced over the cores.
Each microtask itself has characteristics similar to the statically allocated tasks of the weakly coupled strategy. That is, the parameter blocks used to communicate between microtasks are equivalent to (and often will contain the same data as) the buffers used to communicate among the cores and peripherals in the weakly coupled strategy. As such, they are subject to the same cache considerations as the buffers. In particular, a task that is creating a parameter block for a new task must be sure to flush the data from its cache in case the new task will be executed by the other core.
Similarly, the core starting work on a new task must clear any stale copy of the locations comprising the new parameter block from its cache before executing the task. However, an optimization is possible: If the parameter block is tagged with the identity of the core that created it, clearing the cache can be omitted if the creator is the same core as the executor. Note that the executor still must clear the line containing the creator tag in the parameter block; otherwise, the executor may use a stale creator indicator and think it was the creator when it was not.
In this strategy all application data is present in the parameter blocks, which are communicated but not shared. Unlike in the earlier strategies with their duplicated and disjointed data spaces, in the microtask lists strategy certain system data are shared across the microtasks. These include:
- The dispatch list
- The heap
All access to shared mutable data must be coordinated by a synchronization mechanism to preclude race conditions on update. Processor instruction sets are incredibly varied in the synchronization primitives they offer, from classic read-modify-write primitives, such as test-and-set or fetch-and-add, to single or multiple hardware synchronizers reached through MMIO registers. Fortunately, most implementations abstract this complexity and provide a higher-level mutual exclusion primitive in a support library; you don’t have to resort to machine-level operations to use them.
Sharing the dispatch list. Each core independently and asynchronously accesses the work list to add new microtasks and to remove the oldest one for execution. This access must be coordinated by a synchronization mechanism to preclude race conditions when updating the queue. Typically, an application using the microtask lists strategy statically dedicates one of the available mutexes to managing access to the worklist. In operation, a core wishing to access the worklist will first acquire the associated mutex, then do its list operation, and finally release the mutex.
Note that the list head and other associated data are shared among the cores, therefore causing the usual cache issues. While it is certainly possible to clear the cache location occupied by the list head before reading it, and to flush the location from the cache after writing it, doing this by hand is likely to be error prone. Instead, the application designer should consider placing the worklist data structure (but not necessarily the parameter block entries) in uncacheable memory. This obviates the need to manage cache by hand, and the extra performance overhead is likely to be trivial.
The designer might also consider placing the parameter blocks in uncacheable memory. Assuming that the bulk of the data is in separate buffers (as in the weakly coupled strategy) pointed to from the parameter blocks, the blocks themselves are likely to be quite small. Consequently, the gain in application simplicity by eliminating cache control for access to the blocks may well be worth the overhead of making these uncacheable. The main buffers remain cacheable and subject to necessary manual cache control as in the weakly coupled strategy.
Sharing the heap. Most C library heap implementations assume that the heap is global. Whether this needs to be true in a microtask lists strategy application depends on the details of the tasks. In general, if a heap allocation can survive beyond the execution of a single task (i.e., heap data is created and passed among microtasks), then the heap must be global. This means that access to heap operations must pass through mutual exclusion so that concurrent heap operations do not interfere with each other. Moreover, all heap data potentially passed between tasks is subject to manual cache control for both read and write. Lastly, all heap control structures must be in un-cacheable memory or access to the control structure must be carefully accompanied by explicit cache purge and/or flush operations. Note that most heap implementations maintain their data structures threaded in the space being managed, which typically must be cacheable.
Alternatively, if the lifetime of heap data is always less than any single microtask (i.e., each microtask always deletes any memory it allocates), then a different approach is possible. In this second approach the heap is statically split into two separate heaps, one dedicated to each core. The heap administration data is replicated, so that the two heaps are completely independent of each other. The correct administration is selected by the core number of the core performing the operation, typically determined by reading an MMIO location or from a special instruction. Because live locations are never passed across tasks (hence, across cores), using manual cache control on a read of heap data is not necessary. However, manual cache control is still necessary on write, or rather on deallocation; otherwise, delayed cache writeback from one core (to freed heap memory) could pollute the data used by the other core (after reallocation). This cache control can be put inside the library “free” function to ensure that it is always executed.
These complications suggest that the application designer using multiple cores would be well served by going to some lengths to remove all use of the heap from the application. Be aware that a number of the standard C library functions use the heap internally to create their return values, although alternative functions often exist that place their result in argument buffers instead of using the heap.
Sharing errno. Another source of potential data-sharing problems is the C language variable errno, which is normally global. This variable is set by essentially all system and library calls of the language. Consequently, errno access can lead to race conditions where a function in a task on one core reports an error and sets errno, but by the time errno is inspected in that task it has a new value set by the task in the other core. This event is unlikely, because each core is likely to maintain its own copy of errno in its own cache, but it is not impossible. The resulting software problems will be unreproducible and very difficult to find.
The errno problem has been recognized by multithreading packages on many systems, and the general solution is to change errno from a global to a per-thread datum. In the context of the microtask lists strategy, each task is equivalent to a thread, so one solution is to put an errno in each parameter block. The C library is then modified and the global declaration of errno is replaced by a macro that addresses a field in the parameter block. The C library must then be recompiled. Alternatively, errno can be made a per-core datum and either located in a dedicated register or as an array indexed by the core number; again, the libraries must be modified to remove the global declaration and a macro substituted with the desired addressing.
Both of these above solutions (per task or per core) presume that the lifetime of a particular value of errno is no more than the microtask that creates it. While somewhat at variance with the language standard, this constraint should not cause any application difficulties. Consequently, it is unnecessary to do manual cache control for errno using either of these approaches. Both the creation of a new value and any use thereof will take place in a single core under the microtask lists strategy, because tasks are run to completion in whichever core starts them. Note that simply retaining a single global errno but putting it in uncacheable memory is not correct; although this eliminates cache control problems, it leaves the race condition among independent accesses to errno in the different cores.
The main advantage of the microtask lists strategy is the possibility of dynamic load balancing across the cores. The main drawback is the algorithmic complications resulting from sharing data locations across cores. A second drawback is the necessity to manually partition the application into microtasks and manually set up and administer the parameter blocks and task activation.
SYMMETRIC MULTIPROCESSING (SMP)
The microtask lists strategy partitions the applications into units small enough that they can run to completion without starving other consumers of processor cycles. This model does not lend itself to applications that consist of a very large low-priority background task, together with a number of small high-priority foreground tasks. Of course, this poses no problem if the foreground tasks together will all fit into a single core while the background task runs in the other, but that situation is better modeled using the independent applications strategy where the background is one application and the collection of foregrounds is a second (multithreaded) application.
When the foregrounds together need more than one core, you must interrupt the background task on its core, run the foregrounds on both cores for awhile, and then resume the background when the foreground load drops. This situation is equivalent to conventional priority-based multiprocessing on a Unix desktop. A process is allocated to a core, which executes it either to completion, to expiration of its time slice, or to arrival of a ready process of higher priority (perhaps as a result of an interrupt). Thereupon the core is assigned to a different process. The situation is very similar to the microtask lists strategy (in particular, the processes do not share data), except that a process (task) can be interrupted arbitrarily during its execution and may be resumed on another core.
Implementing SMP requires a task switch that folds all the processor transient state down into the task parameter block so that it can be resumed later. This is unnecessary in the microtask lists strategy because all tasks are run to completion and no transient state exists. There are some important complications:
- Shared global structures
- Task switch and caching
- The stack
Shared global structures. All data structures shared in the microtask lists strategy are also shared in SMP, and the same implementation considerations apply. The process dispatch list must be protected by a mutex, and typically there are more points at which it is touched than in microtask lists. Either the heap administration is global (and access protected by a mutex) or there must be per-process heaps; per-core heaps are not possible because processes (users of the heap) migrate between cores. Similarly, errno can be implemented only as a per-process datum instead of per-core, or an inopportune switch could lose the errno value of a process.
Task switch and caching. Because an interrupted process may be resumed in a different processor, you must ensure proper cache control on task switch. The easiest way is probably to purge the entire data cache (writing back all dirty locations and forgetting all lines) on switch. This ensures proper switch semantics, although handling special cases such as switching into a process last run on the same core may provide significant performance improvements. Cache purge on switch can be expensive, so processor allocation granularity must be quite large. In particular, rewriting all interrupt handlers is usually necessary to remove implicit thread switching. A dedicated thread and core that fields all interrupts without switching will provide much better performance.
The stack. Much of the transient state of a process is located in the process stack, which is per process in SMP, and per core in microtask lists. This is no problem if the number of processes is statically fixed and their stacks allocated by the linker. If the number of processes is dynamic, however, then the allocator of stacks is itself a shared structure and must be protected by mutex and (typically) the administrative structures placed in uncacheable memory. The same holds for global structures of the threading package itself.
The advantage of SMP is the possibility of load sharing of application units that are too big to be run to completion. The drawback is the great increase of complexity required to ensure that concurrent access to globals and cache coherency problems do not cause inconsistent application behavior.
The multithreading strategy is the same as the SMP strategy, except that concurrent threads can share arbitrary application data with arbitrary timing of access. This strategy imposes all the well-known difficulties of any multithreaded application on any machine. In particular, all access to any shared memory object must be protected by suitable mutual exclusion primitives, and the data itself specified as volatile to the compiler.
Beyond the usual problems of multithreaded applications, multicore embedded hardware typically imposes some additional difficulties on a multithreaded strategy. Because the cores lack hardware cache concurrency, shared data must either be placed in uncacheable memory or all accesses must be bracketed by manual cache control to ensure a consistent view of the memory image. In addition, the cache must be purged on task switch, just like the SMP strategy.
The benefit of a multithreaded strategy is that it provides maximum usage of the available processing cycles for any application concurrency. The drawback (once the code works) is that task switch will be expensive, and so overall performance may be better using a less aggressive strategy such as the weakly coupled strategy.
SELECTING A STRATEGY
This article has provided a taxonomic classification of application partitioning strategies for use in incoherent multicore embedded applications. In practice, the independent applications strategy is by far the simplest and will have the best time to market, but may be too expensive in hardware. At the other end, a multithreading strategy is all but impossible to implement absent hardware cache coherence support. The strategies between these extremes vary in their difficulty and performance. Strategy selection is usually easy: For any given application, one or another of the strategies will leap up and provide an obvious mapping to what the application does.
Then the hardware budget is reduced, or the application is redefined, and the fun begins...
IVAN GODARD was formerly a senior architect at Trimedia, a developer of embedded processor cores. In various prior lives he has designed, implemented, or led the teams for three instruction set architectures, nine compilers, an operating system, an OODB, and a host of embedded and general applications. He is working to start up a fabricationless chip company based on a novel architecture that offers a factor of 10 or more performance gain over superscalar processors in general code.
Originally published in Queue vol. 1, no. 2—
see this item in the ACM Digital Library