And the function computes the energy used the program and the error

Automatically Exploring Tradeoffs Between Software Output Fidelity and Energy Costs

1 INTRODUCTION

T these scales, non-functional [1] properties such as energy to support a growing spectrum of applications. At HE use of data centers has expanded in recent years
	Dorn, Lacomis, and Weimer were with the Department of Computer Science at the University of Virginia, Charlottesville, VA, 22904 when this work was done. Email: {dorn,lacomis,weimer}@virginia.edu Forrest is with Arizona State University, Tempe, AZ and the Santa Fe Institute, Santa Fe, NM 87287.

One of the difficulties in managing energy usage at the software level is lack of visibility into how implementation decisions relate to energy use [9]. Indeed, the success of approaches at so many different levels (hardware archi-tecture, operating systems, compilers, and API selection) is an artifact of the large number of variables that influ-ence energy consumption. We address this difficulty using stochastic search to modify compiled assembly programs and measure the energy required to execute an indicative workload, allowing us to identify beneficial modifications. We observe that common hardware-based techniques for measuring energy consumption have significant limitations; we therefore propose and evaluate techniques to mitigate this issue. Our method, called POWERGAUGE, provides energy reductions that retain required functionality, and we investigate more aggressive reductions for applications that can tolerate slight reductions in output quality [10]. We achieve this via a multi-dimensional search algorithm.

We evaluate POWERGAUGE on the PARSEC [11] bench-

The main contributions of this article are as follows:

•	An empirical evaluation of POWERGAUGE using large- and small-scale applications representative of data center applications. We find that our technique reduces energy consumption by 14% over and be-yond gcc -O3 for fixed output quality and 41% when relaxing the output quality. An exploration of techniques for managing search space explosion due to large program sizes.

2 BACKGROUND AND RELATED WORK

In this section we discuss three broad approaches to power improvement: GA-based techniques, semantics-preserving techniques, and approximate computing.

1.
2.

2.2.1 Superoptimization

Superoptimization techniques [22], [23] check large numbers of sequences of assembly instructions to find an optimal execution sequence. These techniques are similar to ours in that both may change the implementation. However, superoptimization techniques scale only to short sequences of assembly instructions, while our approach operates on entire programs. Superoptimization and our approach are both assembly-to-assembly transformations, but they are independent and could be composed together in any order.

2.3 Approximate Computing

Approximate computing has emerged as an alternative ap-proach to decreasing runtime and energy consumption [28], [29]. By trading off some computational accuracy, approx-imate computing allows for reduced runtime or energy consumption, similar to how lossy compression trades off quality for space efficiency.

2.3.3 Precision Scaling

Precision scaling improves efficiency by altering arithmetic precision [34], [35]. Adjusting variable precision can expose optimizations or make more efficient use of hardware. For example, rounding floating point values near one can com-pletely optimize out an expensive floating point multiplica-tion. Additionally, scaling the precision of data can change memory layout and lead to better cache performance as alignments change.

2%

% error	1%

0%	50%	100%

3 MOTIVATION

Energy consumption is a significant cost for data center scale computing. With power for American data centers projected to reach an annual cost of tens of billions of dollars [3] in the coming years, companies have already begun taking steps to reduce energy expense. A recent example is Google’s $2.5 billion investment in wind and solar farms near their data centers.3
In addition to hardware and compiler techniques for reducing energy consumption, there is a need for software modifications to further reduce these costs. In this article, we present POWERGAUGE, a mostly-automated technique and prototype implementation for exploiting opportunities for relaxing output quality to reduce energy consumption. POWERGAUGE takes as input compiled assembly code and an existing test suite, and produces an optimized program with the same behavior but reduced energy requirements. In this scenario, POWERGAUGE imposes little additional bur-den on developers, since they may reuse existing tests and need only adapt their build process to produce assembly files and provide a mechanism to measure energy consump-tion. (We discuss one possible mechanism in Section 4.) To achieve even greater energy reductions, the test suite may be augmented with a metric that estimates the quality of the output (instead of merely pass or fail), enabling POWERGAUGE to search for programs that optimize for energy consumption while allowing for small differences in output. We give some examples of simple, yet effective metrics in Section 6.1. With such an augmented test suite, POWERGAUGE produces a list of Pareto-optimal programs that trade off energy consumption and error. In this case, the developer or end user may select the program that provides the most desirable balance for their particular use case.

4 POWER MEASUREMENT

To optimize the energy usage of a program, our search re-quires a fitness function to estimate the energy consumption of each individual. To achieve this, we require a mecha-nism capable of measuring whole-system energy of indi-vidual servers without requiring hardware modifications. This apparatus must also have suitably fine-grained time and energy resolution and a reporting rate that does not greatly increase our search times. Additionally, to minimize noise due to overhead on the system under test, we require that the device be entirely self-contained without relying on monitoring software running on the same system. As a practical matter, we also require it to be sufficiently cost effective to run several experiments in parallel, allowing one to take full advantage of the independence of fitness evaluations in a GA.

One common alternative to energy models is direct measurement. For example, many commercial devices, such as the Watts up? PRO meter, can inexpensively measure and report energy consumption. These off-the-shelf meters are simple to install and easy to use: the system under test is plugged into the device and energy consumption is reported over USB. However, these devices are typically designed for long-term monitoring and are not for capturing rapid changes such as those caused by relatively short program executions. POWERGAUGE typically compiles and eval-uates tens of thousands of candidate programs during a search, and the limited (1 Hz) reporting rate of the Watts up? PRO creates delays that greatly increase the time required for a search (see Section 4.2).4Note that sampling rate is distinct from reporting rate; sampling rate refers to the rate at which a signal is measured, while reporting rate refers to the rate at which the samples are sent to the system monitoring the measurement device.

Although some specialized solutions for measuring en-ergy exist, we were unable to find one suitable for our application. LEAP [40] and similar projects [41] use a spe-cialized Android platform to measure energy consumed by mobile devices, and could not be directly adapted to monitor server systems. JetsonLeap [42] is designed to accu-rately measure power consumption to enable power-based compiler optimizations, however it requires that the system under test have a general-purpose I/O (GPIO) port. GPIO ports are common in System-on-a-Chip devices such as the Arduino or BeagleBone, but are rare on server systems. Similarly, the Monsoon Power Monitor5is only designed to measure power on mobile devices rated to 4.5 V and costs $771 to measure a single device. Other approaches require a separate measurement PC and either only monitor CPU power [43], or measure whole-system energy but require specialized boards to be installed on power lines inside the device under test [44].

Fig. 2: Energy meter setup. A: emonTx V3 energy monitor-ing node. B: Raspberry Pi. The Raspberry Pi is connected to the emonTx by UART (multicolored cable), and reports via ethernet (bottom cable). C: Monitored receptacle. D: Accu-CT ACT-0750 5A current transformer. Note that the current transformer is around the hot (black) wire attached to the receptacle. Up to four current transformers can be moni-tored by the emonTx at once. E: AC-AC voltage adapter. The emonTx uses this voltage adapter both for power and for monitoring the voltage present at the receptacle.

4.2 Measurement Apparatus

Although this baseline hardware provides a cost-effective solution for high resolution time and energy mea-surements, we found that the default firmware needed to be completely rewritten to meet our requirements. Our soft-ware running on the microcontroller combines the signals from the current transformers with the voltage reading from the AC-AC voltage adapter to compute the real power on each line. This power is reported via the on board UART serial device. Our present prototype implementation is ca-pable of reading inputs from the four current transformers and the voltage adapter at a sampling rate of about 1200 Hz, which is significantly faster than can be transmitted via the serial controller. We therefore aggregate a configurable number of measurements together and report the average power usage less frequently. For all experiments in this article, the microcontroller sampled at 1200 Hz and reported measurements on the serial bus at 10 Hz. This is ten times faster than the reporting rate that was possible with the Watts up? PRO energy meters, and supports measuring energy consumption at a rate that makes large-scale searches feasible.

The code to convert the integral sensor readings into floating point current and voltage readings requires coef-ficients to scale the values properly. We calibrated these using a Watts up? PRO device as a baseline. Although the Watts up? PRO is not suitable for fitness evaluations, as a commercially calibrated meter, it is suitable for use as a baseline for calibration, which can tolerate slower responses. Note that properly calibrating real power measurements requires a resistive load, such as a high-wattage light bulb or small heating element, so that real power and apparent power are equal [45, § 8.4]. We used a lamp with three 40-watt incandescent light bulbs to produce a large enough load for the limited power resolution of the Watts up? PRO to provide four significant digits. After calibration we collected 2500 readings of the resistive load to confirm that the power readings from the microcontroller were reliable. We confirmed that they were approximately normally dis-tributed (Shapiro-Wilk normality test, p > 0.1: large p-values fail to reject the null hypothesis that the distribution is normal [46]) and showed a small standard deviation relative to the average value (about 0.7%).

As of 2017, the hardware required to monitor the power consumption of a single machine costs $244. However, a single emonTx v3 node can simultaneously measure four different current transformers. Thus, the additional cost of measuring up to three more machines is only $47 per current transformer. The final hardware cost to monitor four machines is $385, just under $100 per machine. The custom firmware for this hardware outputs data that is directly compatible with POWERGAUGE and no additional support hardware or software is required.

This system provides fast, reliable measurement of a constant load, showing less than 1% deviation in the mea-surement of reference light bulbs. However, the energy usage of a computer system is much more complicated. In the next section, we discuss measures to stabilize the system load.

10.

Input: p : Program
Input: FITNESS : Program →Rn Input: MaxCount : N
Input: PopSize : N
Output: Front : Pareto frontier of Programs

1: function POWERGAUGE(p, FITNESS, MaxCount, PopSize)

P ←{} ADDTOPOP(P, p, FITNESS)

▷Adds p to P and evaluates its fitness

count ←count + 2
Ranks ←NONDOMINATEDSORT(P ∪Q) P ←Ranks[1 : PopSize]

Front ←{}
for all q ∈GETFRONTIER(P) do
Front ←Front ∪MINIMIZE(q, FITNESS)

The measures described above mitigate most of the noise we observe. However, they do not eliminate noise completely. Some remaining potential sources of noise include environ-mental factors such as ambient temperature, the physical limitations of the measuring device, periodic system main-tenance tasks scheduled by the Linux kernel, scheduling delays between opening the TCP/IP connection and starting the subprocess, and communication delays on the TCP/IP or serial communication channels. This remaining noise level is low enough that the GA used by POWERGAUGE is able to effectively search for energy reductions (several works have shown that GAs can be effective even with noisy fitness functions [15], [16], [17]), but we still would like to gain confidence that the modified programs have a statistically significant energy savings. We discuss the sampling step used by POWERGAUGE in Section 5.3.

5 POWERGAUGE OPTIMIZATION ALGORITHM

The following subsections detail our program represen-tation and mutation operators (Section 5.1) and the multi-objective GA (Section 5.2). In Section 5.3, we describe our algorithm for minimizing differences between the original and optimized programs while retaining the optimizations. Finally, we elaborate the techniques we developed to better manage the search space induced by large programs in Section 5.4 and Section 5.5.

5.1 Program Representation

• •	Swap exchanges the positions of two lines in the program.

The main body of the GA consists of Lines 9 to 20. We optimize both energy and error simultaneously to identify those modifications to the original program that provide the best tradeoffs between them. That is, we compute a Pareto frontier, the set of modified programs such that every other program in the search has worse energy or error (or both). To this end, our multi-objective algorithm uses the non-dominated sort from NSGA-II [50]. This sorting defines a pair of values for each member of the population based on their fitness values. In the tournament selection on Line 13, these pairs are compared lexicographically to identify the tournament winner. At a high level, the first element of the pair indicates how close the individual is to the Pareto frontier while the second gives greater weight to individuals in less well-represented parts of the frontier.

After selecting two individuals using the above tourna-ment selection procedure (two tournament rounds produc-ing two winners), the crossover operator is applied with 50% probability. Next, the mutation operator is applied to each child, and finally the fitness of the mutated children is evaluated. Once PopSize new individuals have been gener-ated, we then select the elements of the population the next generation. Our algorithm is elitist, considering individuals from both the previous and current generation together. Out of these PopSize individuals, the best PopSize new or old individuals are selected, according to the NSGA-II ranking, which forms the next generation.

Our minimization algorithm uses Delta Debugging [51], which takes as input a set of edits and identifies a 1-minimal subset of those edits that maintains the optimized performance as measured by the fitness function. The Delta Debugging algorithm is linear time in the original number of edits and requires evaluating the fitness of a new collec-tion of edits at each step.

Because our energy measurements are stochastic, we col-lect several samples and apply a one-tailed Wilcoxon Rank-Sum Test [52] to determine whether the distribution of fit-ness values is worse than the distribution of values collected for the optimized variant. If the test for either objective indicates a significant difference between the distributions (p < 0.05), we treat that variant as “unoptimized.” In all experiments described in this article, we collected at least 25 fitness samples for each Delta Debugging query, increasing this number as necessary to increase the power of the statis-tical test, always maintaining relative standard error below 0.01. We found that starting with 25 energy measurements and using the relative standard error threshold provided a good tradeoff between runtime and smaller minimized genomes. At the end of the Delta Debugging stage, we are left with a 1-minimal set of edits to the original program that results in a statistically significant reduction in energy consumption.

1 2 3 4	jle .L91 .p2align 4,,10 .p2align 3
1 2 3 4	movl

5.5	•

		We disallow copying or deleting labels. Either copy-
		ing a label into a file where it already exists or delet-
	•



		We allow Swap to operate on labels only if the ex-
		change is limited to a single file. Swapping a label




		Profiling

Our search space reductions remove from consideration programs that our simple static analyses show cannot im-prove fitness. However, they still allow edits to portions of a program not visited by the target workload. For example, inserting an instruction into a program region that is not executed is unlikely to affect energy usage. We therefore investigate the use of execution profiles to capture the run-time behavior of the program and increase the probability of making edits that impact fitness. The insight behind our use of execution profiles is that beneficial edits in more frequently-executed regions of the code are likely to have a greater impact on the fitness than beneficial edits in less-frequently executed regions.

			Unique		Executed
			Lines	%	Lines	%	Tests (s)	Error Metric
		12,437	3,504	28	637	5	2.7
blender			1,574,349	9	256,687	1	17.6
blender	(planet)	-	-	-		1	10.6
		198,462	62,544	32	23,746	12	3.3	RMSE
ferret		80,811	26,883	33	15,181	19	6.4
		7,511	4,436	59	3,828	51	2.7	Hamming distance
		26,281	12,115	46	10,404	40	7.4
libav	(mpeg4)	22,831,124	698,445	3	42,747	0	1.3
	(prores)	-	-	-	34,634	0	2.7
		55,753	14,911	27	2,911	5	3.2	RMSE
vips		822,655	160,075	19	24,000	3	18.1
		205,801	58,754	29	41,063	20	5.7
		43,128,694	2,836,551

We investigate the following research questions:

RQ1 RQ2

6.1 Benchmarks

We chose blender and libav to investigate the scalability of POWERGAUGE. blender is a large 3D computer graph-ics application supporting a wide variety of tasks such as

program output and the reference generated by the original program. The ferret benchmark computes a number of image similarity queries; the output consists of one list of similar images, ranked by similarity, for each query. Our error metric for ferret computes Kendall’s τ, which quan-tifies the similarity of order between two sequences. Finally, fluidanimate writes out a C struct; since this admits less human intuition about the meaning of “acceptable” levels of error, we simply compute the Hamming distance between the two files.

While in the particular domain of computer graphics there exist error metrics that can account for such mo-tion (e.g., the Earth Mover’s Distance [58] or a Structural Similarity Index Metric (SSIM) [59]), there are two main problems with a general approach to more precise error met-rics. First, many of our benchmarks require domain-specific knowledge (e.g., domain-specific models of readability have required hundreds of annotators [60]), to judge acceptability and domain experts would be needed to train or create models. Second, since every individual created during a search must be measured with the error metric, we would like to minimize the time cost of measuring error by using efficient error models.

A strength of POWERGAUGE is that it can find op-timized programs with human-acceptable levels of error despite imperfect error metrics, because GAs can tolerate a noisy fitness function [15], [16], [17]. A programmer can create a simple error metric for use with POWERGAUGE, then a domain expert can be consulted after the search to subjectively judge the outputs of 10-20 optimized programs created during the search, selecting the most efficient pro-gram with acceptable output.

6.3 Parameters and Hardware Specifications

We use a value of 512 for PopSize and apply exactly one mutation operator to each child after the crossover stage. The number of individuals created varied per benchmark— we targeted a runtime of approximately two weeks for each. With this target in mind, we chose 16,384 individuals for libav (mpeg4), 32,768 individuals for all blender benchmarks and libav (prores), and 65,536 individuals for the remaining benchmarks. Note that the total runtime of POWERGAUGE is not only a function of the runtime values in Table 1, but also of factors including the time to build and to estimate output error. It is also possible that a generated program can enter an infinite loop upon execution. Because of this, we set a maximum runtime of 60 seconds for each individual during fitness evaluation. Including test case timeouts to guard against infinite loops is a standard practice when using GAs to generate and validate new programs [61].

% Energy Red.				% Energy Red.
Baseline				Best
		0% Error	Accept.	0% Error	Accept.
		91	91	92 1	92 10 0
		0	0	92 1
blender (planet)		0	0	0
		0	0	0
		0	30 0	0
fluidanimate*		0	30 0	0
		0	0	8
		0	0	0
		3	3	3
swaptions		39	68 29 65	39
		21		21
		0		0
		13	24	14	41
	High-level summary of the energy reductions
found by our technique. The “Baseline” columns show the results of the experiments when run without search space reductions or profiling, while the “Best” columns contain the best results of all experiments, including those run with search space reductions, profiling, or both. The results are subdivided into “0% Error”, where the output of the optimized program is identical to the output of the original program, and “Accept.”, where a human-acceptable level of error is allowed in the output. The fluidanimate benchmark has no acceptable error level because its output is a serialized binary and this error metric is not suitable for subjective assessment. We find optimizations in 10 out of 12 benchmarks and an average energy reduction of 41 when allowing for human-acceptable error and using search-space reductions and profiling.

R430 server with a 3 GHz Intel E5-2623 processor and 16 GB

When allowing for an acceptable amount of error, POW-ERGAUGE finds additional energy reductions. For the swaptions and vips benchmarks, energy consumption was reduced by an additional 29% and 8% respectively. Energy reductions were also found for the ferret and x264 benchmarks. For blackscholes, bodytrack, and fluidanimate POWERGAUGE found energy reductions when allowing for error, but we judged none of the outputs to be human-acceptable. On the remaining benchmarks, POWERGAUGE was unable to find any energy reduction without incorporating the algorithmic improvements dis-cussed in Section 5.4 and Section 5.5. We discuss the results of augmenting POWERGAUGE with these features in Sec-tion 7.2.

An example of how allowing for a small amount of error can lead to additional energy reduction can be seen in Figure 6. We observe that the images are subjectively very similar, especially in a use case where the output is consumed by human eyes. The vips image has a slight ver-tical deformation or stretching. As a result, the pixels do not line up exactly and there is slight error in each individual pixel (as shown in the inset heat map). In this article we intentionally consider simple, indicative error metrics that represent a worst-case scenario for the search. Allowing error in the vips benchmark allowed POWERGAUGE to find an additional 5.3% energy reduction at the error rate corresponding to the image in Figure 6 when compared to the maximum energy reduction with no allowed error.

		7.2

In this section, we address each of the research questions posed in Section 6.

7.1 Human Acceptability

Two of the larger benchmarks, blender and libav, are representative of software that is normally deployed in server farms. Since these benchmarks have much larger codebases than the PARSEC suite, we were prompted to modify the POWERGAUGE algorithm to target scalability. The codebase for libav is particularly large, at 22.8 MLOC, and thus POWERGAUGE has an extremely large search space. We are able to greatly reduce the number of al-lowed edit locations by applying search space reductions and profiling as discussed in Section 5.4 and Section 5.5. For example, without these reductions POWERGAUGE can apply the Delete operator to any of the 22.8 MLOC in libav, but after profiling this is reduced to only 34,634 lines in the case of the prores input, a search space reduction of 99.8%. The results of searches with profiling and reductions are

	POWERGAUGE (%)	Loop Perforation (%)
blackscholes	96	92
	82	62
	86	70
	82	52
x264	85	50
average	86	65

7.3 Comparison to Other Approximate Techniques

In this section we address RQ3, and compare POWER-GAUGE to other, more specialized approximate computing techniques.

POWERGAUGE often finds optimizations in the form of loop perforations. In blackscholes, POWERGAUGE finds exactly the same power savings at zero error as loop perforation. This is because the blackscholes benchmark deliberately performs redundant calculations in a loop. Both techniques short-circuit this loop, but note that POWER-GAUGE is able to find even more power savings than loop perforation when some error is allowed. In bodytrack, swaptions, and x264, POWERGAUGE finds more ef-fective optimizations at low error rates. On the ferret benchmark, loop perforation slightly outperforms POW-ERGAUGE at low error rates, but POWERGAUGE finds greater overall power savings.

In addition to more common loop perforation tech-niques, such as skipping every nth iteration of a loop or ter-minating a loop early, POWERGAUGE’s general mutations can lead to other types of loop modifications. For example, when a compiler performs loop unrolling, several copies of the loop body are created to speed up execution time. Each of these unrolled loop bodies can be independently modified, allowing POWERGAUGE to insert code that only executes in a specific loop iteration .

8 DISCUSSION

In this section we first discuss the benchmarks where we were unable to report optimizations from POWERGAUGE, even after allowing for error and applying search space reductions and profiling. Next, we discuss the nature of the optimizations found by POWERGAUGE. We then analyze the similarities and differences between optimizing for en-ergy consumption and optimizing for runtime. Finally, we address threats to validity.

benchmark programs we leave the solution of this problem for future work.

8.2 Nature of Optimizations

These problems do not necessarily completely preclude the use of runtime as a fast, inexpensive energy model. It could be possible to use both an energy meter and runtime in a hybrid approach, where the search is performed with

8.4.1 Relaxed Semantics and Error

Since our technique often performs transformations that do not preserve semantics, it is possible that an optimization can change program behavior in an undesirable way. We mitigate this problem by incorporating an error metric into our search and direct POWERGAUGE to minimize the in-duced error. If desired, a developer can specify particular program properties to preserve by incorporating them into the error metric (e.g., assigning an infinite error to any violation of a key invariant or assertion).

disks consume a large amount of energy, the state of on-disk cache is likely to change after an access during a test, which could affect the energy consumption of a subsequent access. Addressing this problem could require modifying the testing framework or the POWERGAUGE algorithm.

8.4.4 Architecture

ACKNOWLEDGMENTS

We would like to thank Kevin Angstadt for his help set-ting up the microcontrollers, debugging their software, and for many fruitful discussions. We are also grateful to Eric Schulte for his contributions to the early stages of this work and Shane Clark at Raytheon BBN Technologies for his helpful suggestions on measuring real-world energy.

	[1]
unlikely to be an appropriate technique. For example, we	[1]
unlikely to be an appropriate technique. For example, we	[2]
make the assumption that the state of the system as a	[2]
whole is identical at the start of every test run, but this	[3]
whole is identical at the start of every test run, but this
is not always practical. In a database system where hard		Issue Paper, Aug, 2014.

[4]

[7]

[8]

S. Reda and A. N. Nowroz, “Power modeling and characteriza-tion of computing devices: A survey,” Foundations and Trends in Electronic Design Automation, vol. 6, no. 2, pp. 121–216, 2012.

I. Manotas, L. Pollock, and J. Clause, “SEEDS: A software en-gineer’s energy-optimization decision support framework,” in International Conference on Software Engineering, ser. ICSE ’14, 2014, pp. 503–514.

[14] W. Weimer, Z. P. Fry, and S. Forrest, “Leveraging program equiv-alence for adaptive program repair: Models and first results,” in International Conference on Automated Software Engineering, ser. ASE’13, 2013, pp. 356–366.

[15] J. R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, 1992.

[20] B. R. Bruce, J. Petke, and M. Harman, “Reducing energy con-sumption using genetic improvement,” in Genetic and Evolutionary Computation Conference, ser. GECCO ’15, 2015, pp. 1327–1334.

[21] M. Linares-V´asquez, G. Bavota, C. E. B. C´ardenas, R. Oliveto, M. Di Penta, and D. Poshyvanyk, “Optimizing energy consump-tion of GUIs in Android apps: A multi-objective approach,” in Joint Meeting of the European Software Engineering Conference and the Foundations of Software Engineering, ser. ESEC/FSE ’15, 2015, pp. 143–154.

[27] H. Jacobson, P. Bose, Z. Hu, A. Buyuktosunoglu, V. Zyuban, R. Eickemeyer, L. Eisen, J. Griswell, D. Logan, B. Sinharoy, and J. Tendler, “Stretching the limits of clock-gating efficiency in server-class processors,” in Symposium on High-Performance Com-puter Architecture, ser. HPCA ’05, 2005, pp. 238–242.

[28] J. Han and M. Orshansky, “Approximate computing: An emerging paradigm for energy-efficient design,” in European Test Symposium, ser. ETS ’13, 2013, pp. 1–6.

[33] H. Hoffmann, S. Misailovic, S. Sidiroglou, A. Agarwal, and M. Ri-nard, “Using code perforation to improve performance, reduce energy consumption, and respond to failures,” MIT, Technical Report, 2009.

[34] Y. Tian, Q. Zhang, T. Wang, F. Yuan, and Q. Xu, “ApproxMA: Approximate memory access for dynamic precision scaling,” in Great Lakes Symposium on VLSI, ser. GLSVLSI ’15, 2015, pp. 337–342.

[39] C. Lee, J. K. Lee, T. Hwang, and S. Tsai, “Compiler optimization on instruction scheduling for low power,” in International Symposium on System Synthesis, ser. ISSS ’00, 2000, pp. 55–60.

[40] D. McIntire, T. Stathopoulos, S. Reddy, T. Schmidt, and W. J. Kaiser,“Energy-efficient sensing with the low power, energy aware pro-cessing (LEAP) architecture,” ACM Transactions on Embedded Com-puting Systems, vol. 11, no. 2, pp. 27:1–27:36, Jul. 2012.

[45] F. Ulaby and M. Maharbiz, Circuits, 2nd ed. & Science Press, 2013.	National Technology

[49] C. Le Goues, S. Forrest, and W. Weimer, “Representations and operators for improving evolutionary software repair,” in Genetic and Evoluationary Computation Conference, ser. GECCO ’12, 2012, pp. 959–966.

[50] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182–197, Apr 2002.

[54] E. Schulte, J. DiLorenzo, S. Forrest, and W. Weimer, “Automated repair of binary and assembly programs for cooperating embed-ded devices,” in International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’13, 2013, pp. 317–328.

[55] C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, “Pin: Building cus-tomized program analysis tools with dynamic instrumentation,”in Conference on Programming Language Design and Implementation, ser. PLDI ’05, 2005, pp. 190–200.

[59] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.

[60] E. Daka, J. Campos, G. Fraser, J. Dorn, and W. Weimer, “Modeling readability to improve unit tests,” in Joint Meeting of the European Software Engineering Conference and the Symposium on the Founda-tions of Software Engineering, ser. ESEC/FSE ’15, 2015, pp. 107–118. [61] C. Le Goues, N. Holtschulte, E. Smith, Y. Brun, P. Devanbu, S. For-rest, and W. Weimer, “The ManyBugs and IntroClass benchmarks for automated repair of C programs,” ACM Transactions on Software Engineering, vol. 41, no. 12, pp. 1236–1256, 2015.

Westley Weimer received the BA degree in computer science and mathematics from Cornell University and the MS and PhD degrees from the University of California, Berkeley. He is currently a professor at the University of Michigan. His main research interests include static and dy-namic analyses to improve software quality and fix defects.

Stephanie Forrest received the BA degree from St. Johns College and the MS and PhD de-grees from the University of Michigan. She is currently at Arizona State University, where she directs the Center for Biocomputation, Security and Society and is Professor of Computing, In-formatics, and Decision Systems Engineering. Her research interests include biological mod-eling, evolutionary computation, software engi-neering and computer security. She is a fellow of