May/June 2020 issue of acmqueue The May/June 2020 issue of acmqueue is out now

Subscribers and ACM Professional members login here

Quality Assurance

  Download PDF version of this article PDF

Orchestrating an Automated Test Lab

Composing a score can help us manage the complexity of testing distributed apps.


Networking and the Internet are encouraging increasing levels of interaction and collaboration between people and their software. Whether users are playing games or composing legal documents, their applications need to manage the complex interleaving of actions from multiple machines over potentially unreliable connections. As an example, Silicon Chalk is a distributed application designed to enhance the in-class experience of instructors and students. Its distributed nature requires that we test with multiple machines. Manual testing is too tedious, expensive, and inconsistent to be effective. While automating our testing, however, we have found it very labor intensive to maintain a set of scripts describing each machine’s portion of a given test. Maintainability suffers because the test description is spread over several files.

Experience has convinced me that testing distributed software requires an automated test lab, where each test description specifies behavior over multiple test-lab components.

In this article, I address the need for a centralized test description, called a score. I describe the syntax of a score as a means of discussing the issues I have encountered and some methods for addressing them. These ideas are based on our efforts at Silicon Chalk to centralize the automation of as much of our testing as possible. Although other applications may require different test-lab infrastructures, I outline what I believe to be the core issues.


At Silicon Chalk we need a large test lab. This is only partly because we need to test our application with a variety of hardware. Our software is specifically designed for in-class use, where many students interact with an instructor. Classes may contain 20 to 200 laptops connected through a wired or wireless network. Silicon Chalk supports collaboration, communication, exercises, note taking, and presentation in face-to-face classes where some or all students have laptops, desktops, or tablet computers. There are many ways for actions on one participant’s machine to have effects on all the others. The collective dynamic behavior of multiple instances of Silicon Chalk, all communicating with each other, requires that it be tested in a multiple-instance environment, rather than testing individual instances.

This is a fundamental difference between distributed applications and more traditional solitary applications such as Excel. For example, two users working with Excel simultaneously need not know of the other’s existence. The only impact they may have on one another is through contention for an external resource (e.g., a spreadsheet). Such contention can usually be resolved by arbitrarily assigning the resource until it is free to be passed on. In contrast, multiple instances of Silicon Chalk create a congregation where each instance has a vested interest in putting information on the network with coordinated care.

In a Silicon Chalk session, information predominantly flows from instructor to students. For example, the instructor presents a set of slides, and they appear, one by one, on the students’ displays. At other times during the session, information may flow from the students back to the instructor (e.g., the students’ answers to a quiz). So at any given time, there may be many machines trying to put data onto the network.

In particular, a wireless network is quite fragile and can start to leak data as it nears capacity. As the members of the congregation send data on the network, they must pay attention to how much data is leaking. Note that Silicon Chalk is an example of a distributed application where some resources (e.g., the wireless network) must be managed as a congregation rather than individually.

One of the most important types of tests we perform is the Group Test, which simulates a typical session on a collection of many machines usually connected through a wireless network. We simulate various session activities and measure network utilization to verify performance. We conduct these tests on different numbers of machines and network configurations (e.g., 802.11b only, b+g, g only, single access points, multiple access points, etc.).

The variety of wireless network equipment may or may not have an impact on the network’s performance (from Silicon Chalk’s point of view). Different network hardware has the potential to deliver data at differing rates. Thus, different instances will have different data transfer characteristics, which results in different behavior. These interactions must be investigated to ensure that a campus with a diverse range of student laptops will be able to conduct satisfactory Silicon Chalk sessions.

Because Silicon Chalk is an interaction of many machines in parallel, testing becomes complex. Each configuration needs its chance as instructor and student. The interaction of Silicon Chalk tools during a session also increases the number of potential cases to be tested.


Getting through all of this testing manually is a horrendous job. Automation is the only viable option. It is important to realize that automation is not limited to the execution of the application being tested. There are several aspects of testing to consider, which are applicable to distributed applications in general:

Component configuration. The test lab contains a variety of equipment, including computers and the network equipment used for the test. The components need to be configured before the test begins. This includes network settings and operating system/software images used on the computers.

Build installation. Most modern software development projects have automated build systems. Once the build has completed, automated testing makes sense. Before this can happen, the latest successful build has to be installed on all computers in the test lab.

Test execution on a given component. I associate actions with components rather than with the application, because Silicon Chalk also interacts with other applications. As such, actions might refer to other applications and are not limited to Silicon Chalk. The instructor might present another application to the students, or the Classroom Management Tool on a student’s machine might report open applications to the instructor. For this reason, the actions to be performed on a machine might include opening and driving other applications before, and after, Silicon Chalk has been started. An agent on a machine is responsible for processing the script for that machine.

In addition, some actions may occur on such hardware as access points. For example, this would allow us to ensure that Silicon Chalk degraded gracefully in the face of a degradation of the wireless network.

Synchronization of actions across a number of machines. For repeatability, the actions performed on different machines must be synchronized. For example, students cannot begin answering a quiz before they have received it.

Post-test log file analysis. Once the test has ended, data needs to be collected and analyzed to determine the success of the test. It is possible that performance issues exist, even though the session appeared to proceed normally.

Build installation and post-test analysis will typically be the same for each test, so the specification for a particular test would need to include a component configuration together with a set of actions and how those actions are synchronized. For the remainder of this article, I focus on specifying synchronized actions.

One natural approach to scripting synchronized actions is to write one script for each machine and encode synchronization points into each script. A synchronization point is like a gate in the script. Actions scripted after the synchronization point can be processed only after all agents have reached that point.

Having multiple scripts per test, however, creates a problem: it makes the test difficult to maintain. The primary problem is readability. Creating a mental picture of what is going on in the test lab by looking at multiple sources is quite cumbersome. It can be done with multiple scripts, but it’s an approach that becomes tedious and error prone. A better solution is to write one script in a way that encodes not only the actions performed on each machine, but also how those actions are synchronized.


Let’s consider some desirable properties of a script that refers to multiple test-lab components. I refer to such a script as a score. (See Score Language Constructs sidebar.) The score needs to describe not only a sequence of steps, but also what happens in parallel on different test-lab components. A test-lab component can be a machine, an access point, a switch, etc. Anything that can be automatically manipulated that is also relevant to a test should be scriptable in the score. The score is executed on a server, the conductor, which controls the test lab.

To be maintainable, the score must first be readable and must refer to logical components rather than physical ones. This allows one score to be applied to a wider variety of configurations. Scores might contain a preamble that makes the component mapping explicit, or it might be done algorithmically. Either way, the body of the score should contain logical references only.

A key point is that a logical reference might actually refer to a group of machines. We use this approach to simulate a number of students performing a task on different computers in parallel. Blocks of instructions can be assigned to a non-empty set of components that is given a name (e.g., groupA in the Score Syntax sidebar). This property makes a score versatile and fault tolerant. A score is written assuming that on any given run of the test, there may be different physical members of groupA. For Silicon Chalk, there are several student machines, but only one instructor. By writing the script with Instructor and Student groups, we have the versatility to have each machine in our test lab behave as the instructor, in turn.


Failures may occur on any of the test-lab components during a test. These failures need to be identified so that the cause can be determined. Manually monitoring all of the components of the test lab is infeasible, so errors need to be detected automatically and logged for later analysis. A failure on one component does not mean that the test has failed. Indeed, some tests may be designed to cause failure behavior on various components in order to test robustness.

It is important to distinguish the difference between the recoverable failure on a component, the failure of a component, and the failure of a test. As long as a minimal set of functioning components exists in the test lab, there is value in continuing with the test. Since we wish to uncover as many failures as possible with our tests, it is desirable to continue until the minimal set no longer exists.

Some consideration must be given to identifying when a test has failed, so that the test lab can be reset and the next test deployed. Since the script is written with the assumption that there is at least one member of each group, it is reasonable to assume that continuing the test has value until an action is encountered such that one of the groups specified no longer has members.

For example, if the instructor fails, a Silicon Chalk test is essentially dead and there is no point in continuing. If one student fails, however, we would like to continue the test in order to trap any other failures the test may uncover on other machines. Post-test log-file analysis will alert us to the earlier student failure. Even if multiple student failures occur, the test can continue, so long as there are at least one student and one instructor. Writing the score in this way lets us specify the required minimal set of test-lab components implicitly, and this set can change as the test progresses.

What does component failure mean? A failed component can no longer be trusted to perform as expected. Once this has happened, it is important to drop the component from the test because of its potential to introduce red herrings into the test results.

The conductor must identify a failure somehow. This can be through some component-internal means, such as an exception, or it might be through some component-external means, such as a liveness detector. Some failures may mean that the current operation has not succeeded, but the component is still able to function. Others are more serious and represent component failures. Both of these types can be accommodated by exceptions.

For example, consider the score fragment:

   try {
          [groupA] try {
          catch (local_exception) {
   catch (component_exception) {

The score defines code that might run on different components. In this example, the inner try-block is executed on groupA components. Note that D is outside the scope of the groupA designation, so D would execute on the conductor D if an exception were thrown by C. A local failure occurring during A or B on a groupA component would cause the component to continue at C, but the conductor would not execute D because the component itself did not fail. A more serious exception would be trapped by the outermost try-block, the conductor would execute D, and the test would continue. If there were no outermost try-block, the score would fail, and the next test would be deployed.


Ideally, we would like all the components in the test lab to execute their actions at precisely the right time to ensure that the test is truly repeatable. When debugging, we want to be able to observe the problematic behavior repeatedly to inspect different aspects of the fault. Unfortunately, as we increase the number of components in the test lab, we lose repeatability when we attempt to reproduce a bug. Even if we were able to re-create the correct stimuli across the test-lab components, re-creating the precise synchronized state is incredibly difficult.

There is a small delay between the conductor signaling the start of an action and the agent starting the action on the component. This means that, although we want as short a component start-delay as possible, some nonzero delay is expected. Factors that contribute to this delay include: different operating systems on the components, different additional software, different hard-disk fragmentation, different processor speeds, and different memory.

Since controlling these variables is impractical, I have found that it is actually desirable to assume these variations for each test run. This gives us confidence that, over time, we are covering a more realistic number of conditions with our tests. Since it is unlikely that we would be able to reproduce a given test run exactly, this means we need to rely on instrumenting the application so that we have as much information as possible at our disposal when finding the source of a failure. Logs should contain enough information so that a software developer can understand the context in which a failure has occurred. Some examples of information include code path tags and performance values.


The score syntax says nothing about when execution starts on a given component, nor how long execution on a component will take. All that is guaranteed are the synchronization points. This raises the deadline issue. How long do we wait for a component to arrive at a synchronization point? Somehow, we need to determine and impose a deadline we can use to deem components failed.

If a component misses the deadline at a synchronization point, it is dropped from the test. Indeed, since the component is now out of sync with and running outside the assumptions of the score, the corresponding agent must deactivate the component so that it cannot contaminate further test results.

At Silicon Chalk, our scripting language is based on user interface events that occur at prescribed intervals. This makes it simple to compute a deadline for a response from a component. The deadline is simply the time at which the next event would occur if there were one.

Since we assume that the application can process user interface events within the time interval specified in the script, any additional waiting that needs to happen is specified explicitly. For other scripting environments, other approaches might be more palatable. You may wish to specify the duration explicitly. The inevitable upgrade of test-lab hardware, however, will likely make this a labor-intensive approach.

A statistical approach might be more effective. What is needed is a simple means of specifying the degree of tolerance in the distribution of completion times for an action. For example, suppose that the conductor is configured to tolerate completions up to 10 percent longer than the average action duration, based on completions received. If the conductor knows that it received action starts from 10 component agents in a group and that it has recently received eight completions, then it can continually recalculate its own deadline and deem certain components failed after 1.1 times average duration has passed. Note that the conductor must wait arbitrarily long for the first component to respond.

To avoid waiting indefinitely for a single failed component, a suitably large default deadline might be used. Alternatively, a certain amount of time might be allocated for the test. If the time limit is exceeded, the test is terminated. Our Silicon Chalk tests use time limits plus the user interface event schedule approach, though I believe that the average response factor approach is more desirable. After all, one could envision an enhanced syntax that would allow the optional specification of the average wait factor and hard deadline.

As an example, consider [groupA] 1.5 10 S. This would mean that members of groupA have a maximum of 10 seconds to send an action-completion response to the conductor, and must also beat the deadline that is 1.5 times the current average of received responses. Let’s suppose there are four members of groupA: A1, A2, A3, and A4. If A1 responds at the 4-second mark, then A2, A3, and A4 will be deemed failed unless at least one of them responds before 6 seconds have passed. Suppose that A2 responds at 5 seconds. Now the deadline for A3 and A4 has moved to 6.75 seconds. If A3 responds at 6 seconds, A4 has until 7.5 seconds.

Deadline tolerance can easily be made more rigid. Consider the previous example with a deadline average factor of 1.05. If A1 responds at 4 seconds, at least one more response must be received before 4.2 seconds.


A well-designed test lab can economically address most of the testing requirements for the software it was designed to test, but there are always limits. I have chosen not to address a number of scenarios with the Silicon Chalk Test Lab for practical and economic reasons. Some examples include: very large numbers of laptops conducting several activities over a large number of access points; and exhaustive interactions with other software and wireless network hardware configurations.

For congregate applications such as Silicon Chalk, traditional test scripting is challenging because a description of the test is not centralized. A special script, what I call a score, which encodes script and synchronization information, would address this issue. A score also adds to the versatility and fault tolerance of the scripted tests.

Questions that arise when implementing a score-based test lab include:

In this article, I have suggested some answers to these questions based on my Silicon Chalk perspective. I would expect that different products would favor different sets of answers.

Score Language Constructs

The example here illustrates the language constructs discussed in this article.

The function test1 (below right) specifies the sequential and parallel actions that are to be performed during one test. There would be several such functions, each one representing a separate test.

Above the test specifications are instructions on test-lab setup prior to testing (install-build), a section that executes the tests (test1, test2), and finally a section that collects results and creates summaries.

try <response-average-deadline-factor=5 timelimit=20*60*1000> {
      [laptops] {
 tests = new Array(test1, test2); // test2 defined elsewhere
 try <response-average-deadline-factor=1.5> {
      for test in tests {
 try <response-average-deadline-factor=3 timelimit=4*60*1000> {
      [laptops] reboot();
      sleep(3 * 60 * 1000);
      [laptops] report-log-summary(“\1”);
 function test1() {
      [instructor, students] Login();
      par {
           [instructor] {
                Sleep(10 * 1000);
                Browser.EnterField(“search”, “quirks”);
                Sleep(5 * 1000);
           [students] {
      Sleep(5 * 1000);
      [Instructor] PowerPointShow(“test.ppt”); 
      // contains 25 slides, slide transitions every 12 seconds
      [students] SendQuestion(“This is a test question ” + computername);
      [instructor, students] {

Score Syntax

To specify a sequence of actions, we can use the normal approach found in
most programming languages. Sequence:


This, of course, means that A is followed by B and then C.
To specify that actions happen in parallel, we might use a construct such as:

Par {

This means actions A, B, and C all start at the same time.

To specify which component performs the actions, we can simply tag the actions with a list of the logical components that will execute them.

[groupA] A
[groupB] B
[groupA, groupB] C
par {
      [groupA] A 
      [groupB] B

There are implicit synchronization points between each of the sequential actions (A-B, B-C, C-par). The scripting or programming language used in placeholders A, B, etc. is irrelevant. Indeed, it may be desirable to allow multiple languages, depending on the target components, which may have different
agent script engines.

MICHAEL DONAT ( is the director of quality assurance at Silicon Chalk Inc., Vancouver, BC, Canada. His interest in software development issues began while working as a software design engineer for Microsoft from 1987 to 1992. He earned his Ph.D. from the University of British Columbia, where his thesis focused on the automated generation of tests from a formalized set of requirements.


Originally published in Queue vol. 3, no. 1
see this item in the ACM Digital Library



Sanjay Sha - The Reliability of Enterprise Applications
Understanding enterprise reliability

Robert Guo - MongoDB's JavaScript Fuzzer
The fuzzer is for those edge cases that your testing didn't catch.

Robert V. Binder, Bruno Legeard, Anne Kramer - Model-based Testing: Where Does It Stand?
MBT has positive effects on efficiency and effectiveness, even if it only partially fulfills high expectations.

Terry Coatta, Michael Donat, Jafar Husain - Automated QA Testing at EA: Driven by Events
A discussion with Michael Donat, Jafar Husain, and Terry Coatta

© 2020 ACM, Inc. All Rights Reserved.