Geekbench versus SPEC
In August, we published a blog comparing the power-performance characteristics of NUVIA’s CPU design against shipping CPUs from Intel, AMD, ARM (Qualcomm) and Apple. That blog highlighted the importance of achieving maximum performance within a limited power envelope in the general-purpose server market. It also introduced NUVIA’s first generation CPU, codenamed Phoenix and its projected performance and power based on internal simulations. The power-performance data presented in that blog were based on Geekbench 5. The blog explained the rationale for this choice as follows:
“We believe Geekbench 5 is a good starting point, as it consists of a series of modern real-world kernels that include both integer and floating-point workloads. It runs on multiple platforms, including Linux, iOS, Android, and Windows. It also gives us the ability to conduct these tests on commercially available products .… You may be wondering how we can make the extrapolation from smartphone and client CPU cores to a server core. In our view, there is no meaningful difference. If anything, these environments now deliver incredibly similar characteristics in how they operate.”
Geekbench 5’s execution is well contained within the CPU complex, making the idlenormalized power measurement technique more closely reflect the actual CPU power.
After we posted this blog we saw some industry discussion that Geekbench 5 — while a popular CPU benchmark in the client space (laptop, phone, tablet) is not relevant in the server CPU space. Instead they noted that CPU2006 and CPU2017 is what should be used. In this blog we explore that proposition.
What is an ideal benchmark? Answer: one that is most representative of the customer’s workload. However, quantifying representativeness is challenging (and often controversial) due to the richness and diversity of the workloads that customers of general-purpose CPUs run. In this nebulous back-drop, SPEC CPU2006 and later the CPU2017 benchmark suite have emerged as the de-facto standard for measuring server performance. These benchmarks have a variety of tests (from compiler to AI to weather forecasting) that exercise various aspects of the CPU and the memory hierarchy. These suites measure both the CPU speed and throughput. There are several other server benchmarks such as TPC-C, SpecJBB, PerfKitBenchmarker that cover areas that SPEC CPU is deficient in (JITed code, data-sharing workloads, ..)
So, where does Geekbench fit in? Geekbench is a tremendously popular benchmark in the mobile and client space. It is a cross-platform, easy-to-run benchmark that exercises different aspects of the CPU. The question we are trying to explore is whether benchmarking CPU performance using Geekbench leads us to a different conclusion about the relative speeds of different CPUs than one might arrive at from using SPEC CPU. The simple answer is no.
Our hypothesis is that the performance of different benchmark suites will correlate well with each other so long as the suites are comprised of a diverse set of tests that exercise the different aspects of the micro-architecture, utilize the same instruction-set features and are of similar type (integer, floating-point, database, etc). We believe this to be the case for Geekbench and SPEC CPU. To test this hypothesis, we measured the INT single-core (or Rate x 1) and multi-core (or Rate) SPEC CPU2006, CPU2017 and Geekbench 5 performance for systems shown in Table 1.
|ISA||Type||uArch||#CPUs / #threads||CPU Base / Turbo Freq (GHz)||System’s peak DRAM BW (GB/s)|
|AMD Ryzen 9 3900X||x86_64||Client||Zen2||12 / 24||3.8 / 4.6||42.6|
|Intel Core i9-9900K||x86_64||Client||CoffeeLake||8 / 16||3.6 / 5.0||42.6|
|Marvell Thunder X2||aarch64||Server||Vulcan||28 / 112||2.0 / 2.5||170.6|
|Ampere eMAG||aarch64||Server||SkyLark||32 / 32||3.3 / 3.3||170.6|
|Amazon Graviton2||aarch64||Server||N1||64 / 64||2.5 / 2.5||204.8|
|AMD EPYC 7702||x86_64||Server||Zen2||64 / 128||2.0 / 3.35||170.6|
|Intel CascadeLake 8280||x86_64||Server||CascadeLake||28 / 56||2.7 / 4.0||140.8|
These systems span a broad-spectrum of features — ISAs (x86, ARM), CPU microarchitecture, system architecture and speed-and-feeds. All SPEC CPU binaries used in this study were compiled using clang10 / gfortran10 with O3, PGO, LTO and machinespecific optimizations (and without custom heap allocators). The Geekbench binaries used were purchased from Primate Labs and were run as delivered with no changes. All tests were run under Ubuntu 20.04 Linux. The measured Geekbench INT performance and CPU2006, CPU2017 CPU INT performance are shown in Table 2.
These measurement are also plotted as scatter charts where the x-axis shows the Geekbench score and the y-axis shows the corresponding SPEC CPU score. Figure 1 show this correlation for a single-core (or single-thread / ST / 1T) run and Figure 2 shows the same for a multi-core (or multi-threaded / MT) run. In both cases we see a near-perfect linear correlation with an R² > 0.99.
In fact the correlation is so good that we use it for both predictive and diagnostic purposes. To demonstrate the predictive capability, Table 3 shows measured Geekbench 5 and CPU2006 single-threaded INT score sourced from the web for Intel’s 1065G7 (Sunnycove / Icelake) and Apple’s A12, A13. The table also shows the predicted CPU2006 INT ST scores using the linear equation in Figure 1 and these are within 1% of measurements. Based on past studies, we generally find that the measurements and predictions are within 5% of each other. We can take this further and predict the CPU2006 INT ST score for Apple’s unreleased iPhone 12’s A14 chip using its Geekbench 5 score. It will be interesting to compare this prediction against measurements when review websites publish scores.
While this observation is interesting from a benchmarking standpoint, Geekbench is generally less demanding of the micro-architecture than SPEC CPU is. For a subset of the micro-architectural features, Figure 3 shows the relative metric value for CPU2006 and CPU2017 normalized to a baseline of 1.0 for Geekbench 5. These were generated from detailed performance simulations of a modern CPU. It shows that the branch mispredicts and data cache (D-Cache), data TLB (D-TLB) misses are 1.1x — 2x higher in SPEC CPU compared to that seen in Geekbench 5. For this reason, chip architects tend to study a wide variety of benchmarks including SPEC CPU and Geekbench (among many others) to optimize the architecture for performance.
It is important to note that the observed correlation is not a fundamental property and can break under several scenarios.
One example is thermal effects. Geekbench typically runs quickly (in minutes) and especially so in our testing where the default workload gaps are removed, whereas SPEC CPU typically runs for hours. The net effect of this is that Geekbench 5 may achieve a higher average frequency because it is able to exploit the system’s thermal mass due to its short runtime. However SPEC CPU will be governed by the long term power dissipation capability of the system due to its long run-time. This is something to watch out for when applying such correlation techniques to systems that see significant thermal throttling or power-capping while running these benchmarks.
Another scenario where the correlation can break is non-linear jumps in performance that one benchmark suite sees but not the other. The interplay between the active data foot-print of a test and the CPU caches is a classic source of such non-linearities. For example, a future CPU’s cache may be large enough that many sub-tests of one benchmark suite may fully fit in cache boosting performance many fold. However, the other benchmark suite may not see such a benefit if none of its tests fit in cache. In such cases, the correlation will not hold.
Finally, we would like to end this blog by showing (in Figure 4) the Geekbench 5 power performance chart from our earlier blog. Please note that the Geekbench 5 scores shown in this chart are the overall single-core score, which is a weighted average of the Crypt, INT and FP Geekbench 5 components. Using the equation in Figure 1 and the measured Geekbench 5 INT score for each system (not shown in chart), we can calculate the respective CPU2006 INT and CPU2017 INT single-core score. The chart is annotated with these calculated scores. Since NUVIA has not disclosed the Phoenix CPU’s Geekbench INT score, it is not possible to apply the formula directly. However, one can estimate the SPEC CPU INT score that Phoenix achieves from the scores of the other systems in the chart.
We wish to thank the members of the WCA team (Kun Woo Lee, Christian Teixeira, Sriram Dixit, Sushmitha Ayyanar, Sid Kothari) whose contributions made this blog possible. We also wish to thank John Bruno, Gerard Williams, Amit Kumar and Jon Carvill for their suggestions on this writeup.