A Complete Formal Semantics of x86-64 User-Level Instruction Set Architecture

Sandeep Dasgupta  
University of Illinois at Urbana  
Champaign, USA  
sdasgup3@illinois.edu

Daejun Park  
University of Illinois at Urbana  
Champaign, USA  
dpark69@illinois.edu

Theodoros Kasampalis  
University of Illinois at Urbana  
Champaign, USA  
kasampa2@illinois.edu

Vikram S. Adve  
University of Illinois at Urbana  
Champaign, USA  
vadve@illinois.edu

Grigore Roşu  
University of Illinois at Urbana  
Champaign, USA  
grosu@illinois.edu

Abstract

We present the most complete and thoroughly tested formal semantics of x86-64 to date. Our semantics faithfully formalizes all the non-deprecated, sequential user-level instructions of the x86-64 Haswell instruction set architecture. This totals 3155 instruction variants, corresponding to 774 mnemonics. The semantics is fully executable and has been tested against more than 7,000 instruction-level test cases and the GCC torture test suite. This extensive testing paid off, revealing bugs in both the x86-64 reference manual and other existing semantics. We also illustrate potential applications of our semantics in different formal analyses, and discuss how it can be useful for processor verification.

CCS Concepts

• Theory of computation → Program reasoning; • Hardware → Simulation and emulation; • Software and its engineering → Formal language definitions.

Keywords  
x86-64, ISA specification, Formal Semantics

ACM Reference Format:

1 Introduction

The x86-64 instruction set architecture (ISA) is one of the most complex and widely used ISAs on servers and desktops, and ensuring the correctness of the x86-64 binary code is important. The ability to directly reason about the binary code is desirable, not only because it allows to analyze the binary even when the source code is not available (e.g., legacy code or malware), but also because it avoids the need to trust the correctness of compilers [22, 62].

A formal semantics of x86-64 is required for formal reasoning about binary code, one of the strongest ways to ensure its correctness. An executable semantics is especially powerful because it allows direct testing to gain confidence in the definitions of the semantics, and also because it can allow additional tools based on symbolic execution, like deductive verification and symbolic test generation. Completely formalizing the semantics of x86-64, however, is challenging especially due to the complexity and the sheer number of instructions that are informally specified in approximately 3,000-page standard [11].

Existing Semantics for x86-64  
To date, to the best of our knowledge, despite several explicit attempts [32, 33, 37] and other related systems [13, 35, 36, 40, 41], no complete formal semantics of x86-64 exists that can be used for formal reasoning about x86 binary programs. Heule et al. [37], presented a formal semantics of x86-64, but it covers only a fragment (~47%) of all instructions; as the authors of [37] candidly admitted, their synthesis methodology proved insufficient to add the remaining instructions primarily due to limitations of the underlying synthesis engine. Moreover, their semantics misses certain essential details (Section 3.6 & 4).

Goel et al. [32, 33,], on the other hand, specified a formal semantics in the ACL2 proof assistant [38], allowing to reason about functional correctness, but their semantics covers only a small fragment (~33%) of all user-level instructions. There also have been several attempts [20, 24, 35, 59] to indirectly describe the x86-64 semantics, where they define an intermediate language (IL), specify the IL semantics, and translate x86-64 to the IL. This indirect semantics, however,
may not be general enough to be used for different types of formal analyses, since the IL might be designed with specific purposes in mind, not to mention that the translation may miss certain important details of the instruction behaviors. Refer to Section 7 for a more detailed comparison to existing semantics.

Our Approach We present the most complete and thoroughly tested formal semantics of user-level x86-64 assembly instructions to date. We employed the X framework (Section 2.1) as our formalism medium to leverage its capability of deriving various correct-by-construction formal analysis tools directly from the language semantics. We took Heule et al.’s semantics (Section 2.2) as our starting point to avoid duplicating the formalization effort. We made several corrections or improvements to this semantics, to improve both soundness and efficiency. We automatically translated their semantics into X, and cross-checked the translated semantics against the original using an SMT solver. We manually specified the semantics of the remaining instructions faithfully consulting the Intel manual [11] to obtain the complete semantics. A manual specification may sound like a daunting effort at first, but the fact that (1) x86-64 is largely stable and changes slowly over time, and (2) the state-of-the-art synthesis techniques for language semantics (notably, Strata [37] and Hasabnis et al. [35, 36]) suffer from scalability and/or faithfulness issues (see Section 3.2 and 7 for details), makes the effort worth undertaking.

Like closely related previous work [33, 37], we omit the relaxed memory model of x86-64 and thus the concurrency-related operations. Modelling concurrency is a complex but relatively orthogonal problem in the presence of small-step operational semantics, as shown in prior work [47, 56], where they have integrated their memory model with a small subset of 32-bit x86 instruction set. We believe that integrating such a memory model into our instruction semantics is a promising direction toward rigorously reasoning about real-world programs running on modern multiprocessors. We leave it for future work.

Contributions Below are our primary contributions.

Completeness We present the most complete formal semantics of x86-64 to date. Specifically, our semantics formalizes all the user-level instructions of the x86-64 Haswell ISA (that is, 3155 instructions covering 774 mnemonics [11]), except deprecated ones (336 instructions), the AES cryptography extensions (35 instructions), and the system & concurrency-specific instructions (210 instructions) (Section 3.1).

Faithfulness Being executable, the semantics of each instruction has been thoroughly tested against 7,000+ test cases

1 The current work do not include a formal model of the binary instruction decoder. Note that, all future references of x86-64 “program(s)” or “instructions(s)”, in the context of our model, are meant to refer to the “assembly language programs(s)” or “assembly instruction(s)”.

using the co-simulation method (Section 4). We found errors in both the x86-64 standard document and other existing semantics including the baseline semantics (Section 4).

Usability & portability We illustrate the potential of our semantics to be used for formal analyses such as deductive program verification and program equivalence checking (Section 5). The X framework also enables one to represent our semantics as SMT theories, which allows others to easily reuse our semantics for their own purposes. Indeed, we have translated our semantics to Stoke [27] which can serve as a drop-in replacement of Heule et al.’s semantics [37] and can immediately benefit tools built on Stoke (e.g., [53]).

Our formal semantics is publicly available at [26].

1.1 Challenges in Formalizing x86-64

Size and Complexity The x86-64 ISA has a large number of instructions, partly because of a large number of complex instructions and partly because it keeps most of the legacy and deprecated instructions (∼ 336+) for the sake of backwards compatibility. It consists of 996 mnemonics, and each mnemonic admits several variants, depending on the types (i.e., register, memory, or constant) and the size (i.e., the bit-width) of operands.

Inconsistent Instruction Variants Some variants have divergent behaviors more than the difference of their type and size. For example, movsd, one of the 128-bit SSE instructions, has very different behaviors depending on whether the type of the source operand is register or memory; it clears the higher 64 bits of the target register only when the source type is memory. Indeed, we revealed bugs in other semantics due to their incorrect generalization of the variants’ behavior (Details in Section 3.6, Instruction Variants).

Ambiguous Documentation The x86-64 reference manual informally explains the instruction behaviors, leaving certain details unspecified or ambiguous, which required us to consult with an actual processor implementation to clarify such details. Completely formalizing the vast number of instructions with carefully identifying all the corner cases from the informal document, thus, is highly non-trivial.

Undefined Behaviors The x86-64 standard also admits undefined behaviors that are implementation-dependent. Many instructions (32 out of 996 mnemonics) have undefined behaviors: their output values of the destination register or the %rflags register are undefined in certain cases. That is, the processor is free to choose any behavior in undefined cases. Many existing semantics, however, simply “define” the undefined behaviors by following a specific behavior taken by a processor implementation. This approach is problematic because they do not capture all possible behaviors of

2 These numbers are obtained by parsing the official manual “Volume 2: Instruction Set Reference” and cross checked with projects [16, 57] investing similar efforts.
different processor implementations. Indeed, we found discrepancies between existing semantics in specifying the undefined behaviors, where different semantics are valid only for different groups of processors. That is, such semantics are not adequate to formally reason about universal properties (e.g., portability) of a program that need to be satisfied for all standard-conforming processors. For example, the parity flag %p is undefined in the logical-and-not instruction andn, where the processor implementation is allowed to either update the flag value (to 0 or 1), or keep the previous value. We found, e.g., that Remill [13] updates the flag with 0, whereas Radare [20] keeps it unmodified. Identifying and faithfully specifying all of the undefined behaviors, thus, are desirable but challenging.

In our semantics, we faithfully modeled the undefined value as a unique symbol (called undef) whose value is non-deterministically decided each time within the proper range. These nondeterministic values are enough to capture and formally reason about all possible behaviors of the instructions for different processors (and even any future, standard-conforming processor).

2 Preliminaries
Here we provide background on the \( K \) framework and the Strata project [37] (used for our baseline semantics).

2.1 \( K \) Framework

\( K \) [58] is a framework for defining formal language semantics. Given a syntax and a semantics of a language, \( K \) generates a parser, an interpreter, as well as formal analysis tools such as model checkers and deductive program verifiers, at no additional cost. Using the interpreter, one can test their semantics immediately, which significantly increases the efficiency of semantics developments. Furthermore, the formal analysis tools facilitate formal reasoning about the given language semantics. This helps in terms of both applicability of the semantics and engineering the semantics itself.

We refer the reader to [54, 58] for details. In a nutshell, in \( K \), a language syntax is given using conventional Backus-Naur Form (BNF). A language semantics is given as a parametric transition system, specifically a set of reduction rules over configurations. A configuration is an algebraic representation of the program code and state. Intuitively, it is a tuple whose elements (called cells) are labeled and possibly nested. Each cell represents a semantic component, such as the memory or the registers. A special cell, named \( k \), contains a list of computations to be executed. A computation is essentially a program fragment, while the original program is flattened into a sequence of computations. A rule describes a one-step transition between configurations, giving semantics to language constructs. Rules are modular; they mention only relevant cells that are needed in each rule, making many rules far more concise and easy to read than in some other formalisms.

2.2 Strata Project
Strata [37] automatically synthesized formal semantics of 1905 instruction variants (representing 466 unique mnemonics) of the x86-64 Haswell ISA. The algorithm to learn the formal semantics of an instruction, say IS, starts with a small set of instructions, called base set B, whose semantics are known and trusted; a set of test inputs T, and the output behavior of IS obtained by executing IS on T. Then Stoke [57] is used to synthesize instruction sequences which contain instructions from B and match the behavior of IS for all test cases in T. Given two such generated instruction sequences IS and IS’, their equivalence is decided using an SMT solver and the trusted and known semantics from the base set. If the two sequences are semantically distinct, then the model produced by the SMT solver is used to obtain an input t that distinguishes IS and IS’, and t is added to T. This process of synthesizing instruction sequence candidates and accepting or rejecting them based on equivalence checking with previous candidates, is repeated until a threshold is reached, which in their implementation is based on the number of accepted instruction sequences.

For each instruction, Strata manifested its semantics in terms of two related artifacts. The first artifact is an instruction sequence and the second is a set of SMT formulas in the bit-vector theory, one for each output register. The second is obtained by symbolically executing the first.

3 Formalization of x86-64 Semantics

This section presents how we get the complete semantics of all the user-level instructions. Section 3.1 details the scope of our work. Section 3.6 mentions how we leverage the information available in Strata, our baseline semantics. Section 3.3 explains how we formalize our model in \( K \).

3.1 Scope of the Work
We support all but a few non-deprecated user-level instructions. The support includes 3155 total variants of the Haswell x86-64 ISA (representing 774 out of 996 unique mnemonics). The entire implementation took 8 man-months, with the lead author having prior experience in binary decompilation and strong familiarity with the x86-64 architecture and documentation. Below is a summary of the instruction categories that we support.

General-Purpose. These implement data-motion, arithmetic, logic, control-flow, string operations (including fast- and repeated-string operations).

Streaming SIMD Extensions (SSE) & subsequent extensions (SSE-2, SSE-3, SSE-4.1, SSE-4.2). Instructions in this category operate on integer, string or floating-point values stored in 128-bit xmm registers. Among other things, the category
features instructions related to conversions between integer and floating-point values with selectable rounding mode, and string processing.

Advanced Vector Extensions (AVX) & subsequent extensions (Fused-Multiply-Add (FMA) & AVX2). These instructions operate on integer or floating point values stored in 256-bit ymm registers; a majority of which are promoted from SSE instruction sets. Additionally, the category features enhanced functionalities specific to AVX & AVX2, like broadcast/permute, vector shift, and non-continuous data fetch operations on data elements.

16-bit Floating Point Conversion (or F16C). These instructions implement conversions between single-precision (32-bit) and half-precision (16-bit) floating-point values.

Instructions which are not included in the current scope of work are: (1) System-level instructions, which are related to the operating system, protection levels, I/O, cache lines, and other supervisor instructions; (2) x87 & MMX instructions, consisting of legacy floating-point and vector operations, respectively, which are now superseded by SSE; (3) Concurrency-related operations, including atomic operations and fences; and (4) Cryptography instructions, which support cryptographic processing specified by Advanced Encryption Standard (AES). We note that while there is no inherent limitation in supporting the above instructions with our approach, the system-level instructions require to formulate an abstraction of different architectures and operating systems, which is a significant effort that is orthogonal to the presented effort of formalizing the user-level instructions.

3.2 Overview of the Approach

Briefly, our approach is as follows. We first defined the machine configuration and underlying infrastructure in the $\mathbb{K}$ framework, in order to define, execute and test the x86-64 semantics. To leverage previous work as much as possible, we took the semantic rules of all the instructions supported in Strata, which amounts to about 60% of the instructions in scope, in the form of SMT formulas. We corrected, improved or simplified many of the baseline rules. We then translated these SMT formulas from Strata into $\mathbb{K}$ rules using a script, and tested the resulting rules by comparing with the Strata rules using Z3. These steps give us a validated initial set of semantic rules in $\mathbb{K}$ for about 60% of the target instructions (our "baseline" set).

We attempted to extend the stratification approach in Strata to define additional rules automatically, in two ways: (i) augmenting their base set $B$, and (ii) constraining the search space manually using knowledge of instruction behaviors. Both these attempts failed; they worked only for a few instructions, but in general, we found them to be impractical. Specifically, we added 58 base instructions to the base set, and learned the semantics of 70 new instructions, which are variants of the added instructions, in 20 minutes, but no more even after we kept running for two days. Also, we tried constraining the search space by manually populating it with relevant instructions. The lesson we learned from these experiments is, getting the right set of base instructions or a constrained search space for a complex instruction need an insight about the semantics of that instruction itself. We found that the effort to extract such information from the manual is about the same as manually defining that instruction.

We then manually added $\mathbb{K}$ rules for the remaining 40% of the target instructions by systematically translating their description of the Intel manual into $\mathbb{K}$ rules, in some cases cross-referencing against semantics available in Stoke. The outcome was a complete formal specification of user-level x86-64 in $\mathbb{K}$.

We validated this semantics in three ways, as described in Section 4. First, we use the $\mathbb{K}$ interpreter to execute the semantics of each instruction for 7,000+ test inputs (each input is a processor state configuration) and compared the output directly with the hardware behavior for the same instruction. Second, we repeated this experiment using the applicable programs in the GCC C-torture tests [7]. Third, we compared against the semantics defined in the Stoke project for about 330 instructions that were omitted in Strata (thus not included in our baseline), using an SMT solver.

These validation experiments uncovered bugs in the Intel manual, in Strata’s simplification rules, and in the Stoke semantics. All these bugs were reported to the authors, and most have been acknowledged and some have been fixed. The details are in Section 4.

3.3 Program Configuration

Defining a language semantics in $\mathbb{K}$ requires defining the program configuration, the semantics of how programs are evaluated (i.e., the execution environment), and the semantics of the statements or instructions. We begin with the configuration.

The $\mathbb{K}$ configuration of a running x86-64 program is shown in Figure 1. The cells are represented using angle brackets. The outer $\top$ cell contains the cells used during program evaluation. The inner $k$ cell contains a list of computations to be executed. Below we describe the two other inner cells.\footnote{We omit other auxiliary cells (marked by “…” for the simplicity of the presentation.}

Figure 1. Program Configuration

Register State. The regstate cell contains a map from registers or flag names to values. Note that, all the values or addresses, stored in registers, memory (described next) or flags, are represented as bit-vectors which are depicted in this paper as W$^\top$V, and interpreted as a bit-vector of size W and value V. The register names include the sixteen general
x86-64 Semantics

purpose registers, %rip, and the sixteen SIMD registers. The value mapped to a register name is a 64-width bit-vector (or a 256-width one for the SIMD registers). Values for sub-register variants are derived from the register values by extracting the relevant bits. We store individual flag names (mapped to a bit-vector value of width 1) as opposed to a 64-bit %flags register. Every access (read/write) of %flags retrieves the entries in the regstate map for the individual flags.

Memory State. Our memory model is inspired by previous efforts [29, 41]. The memstate cell is a map from 64-bit addresses to bytes, which specifies the byte-addressable memory4, but our implementation is flexible enough to use alternative memory representations with addressing of 2-byte or 4-byte quantities. Our memory layout is “flat”, in which all available memory locations can be addressed, but we do have logical partitions5 of the memory into sections like code, data and stack. The following is an example snapshot of a memory state, holding a 4-bytes integer value 65535:

$$\begin{align*}
4 & \mapsto \text{byte(0, 32'65535)} \\
5 & \mapsto \text{byte(1, 32'65535)} \\
6 & \mapsto \text{byte(2, 32'65535)} \\
7 & \mapsto \text{byte(3, 32'65535)}
\end{align*}$$

Here the memory address 4 stores the 0th byte of the bit-vector 32'65535, the address 5 stores the 1st byte, and so on. When memory is read, requested bytes are aggregated according to the size of the memory access.

3.4 Semantics of Execution Environment

We now give the reader a flavor of our semantics, by discussing a few of the roughly 5, 200 rules6 that we defined to model the entire semantics. We first explain the semantics of the execution environment, which involves all the machinery used for executing x86-64 programs. We will explain the semantics of individual instructions in the next section.

The execution of an x86-64 program begins with initialization of the configuration with the following contents of the k cell.

$$\langle \text{k} \rangle \mathbf{SPGM}: \text{Instructions} \leftarrow \text{fetch}$$

The symbol $\leftarrow$ is used to separate the computations in the k cell and “;” to represent the type of a term.

Concisely, the semantics of execution of an x86-64 program involves initializing the memory by reading the program instructions ($\mathbf{SPGM}$), from the k cell, one at a time until all the instructions are loaded in memory. The memory-loaded instructions are then fetched one at a time, using the fetch computation, to get executed. The instruction to be executed next is pointed to by the instruction pointer register %rip.

Next, we describe the rule applied to initialize memory with instructions one at a time.

$$\text{rule } \langle k \rangle \mathbf{OpC}: \text{Mnemonic OpR: Operands} \Rightarrow \ldots \langle k \rangle$$

Byte-addressability allows the model to specify both aligned and unaligned accesses in the same principle.

3.5 Semantics of Individual Instructions

Here we explain how we define the semantics of an instruction in K using a running example of logical-and-not andnq ~4(rsp), %rbx, %rax, which performs a bitwise logical AND of inverted source register operand (%rbx) with the source memory operand (~4(rsp)) and writes the result to destination register %rax. Additionally, the instruction affects all the 6 status flags (%sf, %zf, %of, %cf, %af and %pf).

The semantics of most of the instructions can be modeled broadly in 3 phases: (1) read the data from source operand(s), which could be a register, memory or constant value; (2) operate on the data based on the mnemonic; and (3) write the result(s) to destination operand(s), which could be a register or memory. An instruction may exercise some or all of the above phases.

7The nextloc cell is a auxiliary cell that holds the next memory location to store an instruction, which we omit in the program configuration (Figure 1) for the sake of simplicity.

8While initializing the stack section of memory, we store an invalid address just before the entry-point function as return address. When the entry point function returned, the invalid return address is popped out of the stack and stored in %rip leading to program termination.
Read from Source Operand(s)  Instruction in the running example reads from register (%rbx) and memory (-4(%rsp)) operands. A read from register is modeled as a lookup with register name in the regstate map and subsequent read of the mapped value or, for a sub-register, a portion of it. The semantics of register read can be defined as:

\[
\text{rule } \langle k \rangle \text{ getRegisterVal}(R:R64) \Rightarrow BV_i ... \langle k \rangle
\]

\[
\langle \text{regstate} \rangle \Rightarrow R \Rightarrow BV_i ... \langle \text{regstate} \rangle
\]

In the context of the running example, this rule is applied when the current computation (at top of the k cell) is a 64-bit register lookup, appeared as getRegisterVal(%rbx), and regstate contains a register with name "RBX". This rule resolves the register lookup to the mapped bit-vector value BV_i (or BV_{RBX} for the running example).

A read from memory involves computing the effective address in the memory, looking-up address in memory, and reading requested bytes from memory if the memory access is within allowed range. The following rule is applied to compute the effective address:

\[
\text{rule } \langle k \rangle (\text{Offset}:\text{Int}(R:R64)) : \text{Mem} \Rightarrow (64:\text{Offset} + BV_i) : \text{Address}
\]

\[
\langle \text{regstate} \rangle \Rightarrow ... R \Rightarrow BV_i ... \langle \text{regstate} \rangle
\]

The term to the left of \( \Rightarrow \) shows the memory addressing expression, of type Mem, at the top of k cell, which gets reduced to an effective memory address (or EA). The EA for the memory operand used in the running example is (64'-4 + BV_{rbp}) and is used to do memory read access. The rule for memory read access is responsible to read a memory value of requested number of bits (64-bits for the current example) starting from the EA.

Operate on Data  The rules for operating on operands will be different for each instruction based on the mnemonic. For example, the mnemonic andnq requires logical-and-not operation to be computed on the operands.

Write to Destination Operand(s)  The example instruction writes the result to a destination register %rax. Also, the flags sf and zf are updated based on the result; of and cf are cleared, and af and pf are undefined. The rule shown in Figure 2 realizes the destination write operation, where memVal_{64} and BV_i represents the 64-bit data values evaluated using the respective rules for reading register and memory operands (mentioned above). The operator "[i:j]" extracts bits i down to j from a bit-vector of size n, yielding another bit-vector of size i - j + 1, assuming that n > i ≥ j ≥ 0. The operator "&" implements bit-wise "and" operation. The rule associated with memory write is similar to that for memory read and is skipped here.

A x86-64 program is modeled as a list of instructions and its semantics is given by composing the semantics of its constituents.

Figure 2. Example instruction semantics of andn

3.6 Constructing the x86-64 Semantics

Systematic Translation of Strata Rules to K  As mentioned in the introduction, we leverage the Strata [37] semantics to develop our complete semantics, to minimize the overall effort. We systematically translated their semantics into \( \mathbb{K} \). Specifically, Strata offers the semantics of 1905 instruction variants as SMT formulas specifying the behavior of output registers. For each instruction, we converted the SMT formulas that Strata provides to a \( \mathbb{K} \) specification using a simple script (~500 LOC).

To validate the translation, we generated SMT formulas from the translated \( \mathbb{K} \) specifications (using APIs provided by the \( \mathbb{K} \) framework), and use the Z3 SMT solver to check their equivalence to the corresponding formulas provided by Strata. While translating and validating their semantics, we found various issues that we had to fix to establish our baseline semantics. Below we describe the issues we found in Strata.

Status Flags  We found that Strata omitted to specify the %af flag behaviors, as the flag is not commonly used. However, we faithfully specified the semantics of all the status flags in the %rflags register, even if some of them are not commonly used, since they may affect the overall program’s behavior in some tricky cases, and we do not want to miss any of such details when formally reasoning about the x86-64 programs.

Instruction Variants  Strata essentially provides the semantics of the register instructions, assuming that the semantics of the memory and immediate instruction variants can be obtained by generalizing the register instructions. However, we found that certain memory instructions cannot be inferred by simply generalizing their corresponding register instructions. For example, for movsd, one of the 128-bit SSE instructions, its register variant has quite different semantics from the memory variant. Below are their pseudo-code semantics:

\[ \text{exec (andq memVal}_{64}, BV_i, R:R64) \Rightarrow . \]

\[ \ldots \langle k \rangle \]

\[ \langle \text{regstate} \rangle \]

\[ \text{"R" } \Rightarrow (\sim \text{BV}_i \& \text{MemVal}_{64}) \]

\[ \text{"SF" } \Rightarrow (\sim \text{BV}_i \& \text{MemVal}_{64}) [63:63] \]

\[ \text{"ZF" } \Rightarrow (\sim \text{BV}_i \& \text{MemVal}_{64}) = 64'0 ? 1'1:1'0 \]

\[ \text{"OF" } \Rightarrow 1'0 \]

\[ \text{"CF" } \Rightarrow 1'0 \]

\[ \text{"AF" } \Rightarrow \text{undef} // \text{af and} \]

\[ \text{"PF" } \Rightarrow \text{undef} // \text{pf are undefined} . \]

\[ \ldots \langle \text{regstate} \rangle \]

Generalization is based on a hypothesis that the memory or immediate variants will behave identically, on their operands, with corresponding register variant.
As seen, only the memory instruction clears the higher 64 bits of the destination register, which cannot be inferred from the register instruction behavior that does not touch the higher bits at all. We found that another 128-bit SSE instruction, movss, has the same generalization issue. For the other instructions, we obtained the memory and immediate variants by generalizing the register variants, and validated the generalization by co-simulating the inferred semantics against a processor.

**Immediate Instruction Variants** There are 118 immediate instruction variants (over the 8-bit constants) that do not have corresponding register instructions. For those immediate instructions, Strata provides the instruction semantics for each individual constant, resulting in 30,208 (= 118 × 256) formulae for the immediate instructions’ semantics. We generalized the set of formulae for each immediate instruction into a single semantic rule. We validated our generalization by cross-checking the generalized semantics with the original using the SMT solver.

**Formula Simplification** Due to the nature of the stratification, Strata provides complex formulae for certain instructions. We simplified those complex formulae by either applying some simplification rules or manually translating into simpler ones. Then we validated the simplification by checking the equivalence between them using the SMT solver. For example, the original Strata-provided formula for shr x1 %edx, %ecx, %ebx consists of 8971 terms (including the operator symbols), but we could simplify it to a formula consisting of only 7 terms.

## 4 Validation of Semantics

A formal semantics is of limited use if one cannot generate confidence in its correctness. In this section, we describe how we establish that confidence in our model.

### 4.1 Co-Simulations against Hardware

Empowered by the fact that we can directly execute the semantics using the $\mathbb{K}$ framework, we validated our model by co-simulating it against a real machine. During co-simulation, we execute a machine program on the processor as well as on our $\mathbb{K}$ model and compare the execution results after every instruction. In this work, we co-simulated our model against two Intel implementations that were available to the authors at the time of writing: “Intel Xeon CPU E3-1505M v6” and “Intel Xeon CPU E5-2640 v4”. We admit that testing the model against other hardwares (such as AMD) would contribute to more thorough validation of our model, having the potential of revealing flaws in those implementations and/or additional imperfections in the manual as well, which we leave as future work.

We first describe our test-infrastructure and then talk about individual validation experiments and results.

**Test Harness** During co-simulations, we need to make sure that the program must be instrumented similarly both on our model and the real hardware. We use the GNU Debugger [10] to instrument programs on hardware. We developed instrumentation tools based on $\mathbb{K}$ framework to gain similar capabilities for our model. Using these tools we can record the output state (including memory) after the execution of each instruction. To facilitate debugging, in the event when the output states do not match, we developed a tool which points to the first instant when the output states diverge and this saves debugging time.

The co-simulation experiments are done in the following two phases: (1) Instruction level validation: testing individual instructions, and (2) Program level validation: testing combinations of instructions as a part of real-world programs.

**Instruction Level Validation** The goal here is to execute individual instructions both on hardware and our model using test inputs and then compare the output states.

$\mathbb{K}$ already has matured library support for bit-vector, integer and floating point theories. We use bit-vectors to implement the values stored in registers or memory. Depending upon an instruction mnemonic, these values can be interpreted as integers (signed/unsigned) or floating point values (with various precisions). We augmented the library support in $\mathbb{K}$ framework to interpret these bit-vectors accordingly. With that support, we can execute and hence test instructions implementing various floating point operations including conversions (to and from integer/floating point values) with selectable rounding modes (e.g. Nearest, +Inf, -Inf and Truncate).

**Test Inputs** A test input is a CPU state which includes values for all registers, flags and memory. Our test input set contains more than 7,000 inputs, obtained from the following sources: (1) In section 2.2, we mentioned that Strata starts its algorithm with a set of test inputs which keeps on augmenting itself during the process of stratification. We used the final augmented test-suite of 6630 test inputs, (2) While testing instructions implementing floating point operations, we found that many of the test inputs are representing a NaN or Infinity and it does not makes sense to test with many instances of these. We did our best effort by manually generating more than 100 unique floating point values by consulting the IEEE floating point arithmetic standard [1], (3) We used the (~100) test-inputs offered by Remill [13], and (4) We manually implemented a regression test-suite worth of around 200 test-inputs which we accumulated over the course of the project.

---

\[\text{Indeed, Strata explicitly provides only 19,783 formulae by randomly sampling ~168 constants out of 256, in average, for each immediate instruction, assuming that the remaining 10,425 formulae can be inferred.}\]
Note that, each instruction semantics consists of one or more semantic rules, where those rules cover different cases of the instruction behaviors (including the undefined ones). We ensure that our test inputs are sufficient enough to trigger all of the semantic rules, achieving the full “semantic-rule” coverage.

Results  Our current implementation of the fused-multiply-add operation\textsuperscript{11} incorrectly rounds the operation twice (after multiplication and addition) as opposed to once. As a result, we encountered floating point precision issues while testing instructions implementing those operations (\textit{vfmad132pd}). This is a limitation of the underlying \texttt{K} library and more details about this limitation can be found at Section 6.

While performing the validation tests, we encountered various cases where the output state obtained by executing the semantics on our model does not agree with that of the hardware execution. The instruction semantics in our model is either based on the Strata project (for the part we borrowed) or on the Intel manual. A difference in the output state could mean a bug in Strata’s instruction semantics or in our interpretation of the Intel manual or in the Intel manual itself. We found many bugs in our interpretation which we fixed, but in other cases, we found issues in Intel manual and Strata project.

Inconsistencies Found in the Intel Manual  Here are inconsistencies found during development and testing. According to the manual, the semantics of \texttt{vpsravd \%xmm3, \%xmm2, \%xmm1} seems to depend on the lower 100 bits of \%xmm3, whereas the actual hardware execution suggests that it should depend on the lower 128 bits. Similar inconsistencies are found in instructions with mnemonics \texttt{vpsllvd, vpsllvq, vpsravd}. Also, we found misleading typos related to instructions with opcodes \texttt{vpsravw, vpsravd, vpsravq, packsswb}. All these findings were reported and acknowledged by Intel as issues in the manual \textsuperscript{4}.

Inconsistencies Found in Strata’s Simplification Rules  While testing the instructions specifications borrowed from Strata, we found inconsistent behaviors with the actual hardware. Moreover, the inconsistencies were discovered in the formulas of floating point instructions. This is not surprising because Strata models the floating point instructions as un-interpreted functions which cannot be executed or tested on hardware. Their semantics are executable in our definition though, and thus we were able to test them thoroughly. Note that Strata generates the formulas for these instructions by symbolically executing the corresponding learned instruction sequences followed by a formula simplification pass. Therefore, errors in those formulas can be due to bugs either in the symbolic execution engine or in the simplification stage. Our testing shows that the second is true with the following evidence. The simplification rule add\_double(A, 0) == A does not hold for A = −0.0. Same for add\_single. These were reported \textsuperscript{8}. Also, the simplification rule sub\_double(A, A) == 0 does not hold for A = NaN. Same is true for sub\_single. We found this bug in the branch of Stoke which is used in Strata. But this has been already fixed in the latest Stoke branch.

Program Level Validation  The goal here is to test the combination of instructions as part of real-world programs and we chose to use GCC C-torture tests \textsuperscript{7} for this purpose. Specifically, we used the tests inside the “testsuite/gcc-c-torture/execute” directory for GCC version 8.1.0. There are originally 1576 tests, which we compiled using the GCC switches “-O0 -march=haswell -S -mlong-double-64 -mno-80387”. The last two switches avoid generating x87 instructions that are not in the scope of work. We had to exclude 6 programs containing system-level instruction \textit{prefetchnta}, which require modeling caches, which we currently do not support. Many test-cases involve C-library functions, like malloc, fprintf, most of whose semantics are modeled in \texttt{K}. As our support of C-library functions is not exhaustive, we have to exclude 22 programs containing un-supported functions like vfprintf and vsprintf, which we plan to support in future. This brings us to a grand total of 1548 viable tests, which are all tested. Out of those, we found that there are 293 cases where floating point instructions are used covering 35 unique floating point operations. Moreover, all the test-cases together cover about 963 instruction variants, covering 30% of our supported instructions. As before, we executed each program on the processor as well as on our model and compared the output state after every instruction, which matches in all the cases\textsuperscript{12}.

4.2 Comparing with Stoke  Stoke \textsuperscript{57}\textsuperscript{13} contains manually written semantics for ~1764 x86-64 instruction variants, a large fraction (81%) of which is also supported by Strata. The remaining fraction is exclusive to Stoke. Comparing with Stoke provides an additional crosscheck on our model. Moreover, these manually written formulas are based on a similar model of the CPU state to ours, which makes it easier to compare them against ours by using an SMT solver. While doing so we found inconsistencies between the two formalisms in a total of 16 mnemonics (42 instruction variants), and after careful analysis, identified these as errors in the Stoke specification of instruction semantics, as follows.

\textsuperscript{11}According to the standard IEEE-754-2008 [1] (Definition 2.1.28), the operation fused-multiply-add(x, y, z) computes $x \times y + z$ as if with unbounded range and precision, rounding only once to the destination format.

\textsuperscript{12}Note that none of test-cases include floating point instructions implementing fused-multiply-addition, which we already acknowledged to have precision issues.

\textsuperscript{13}Recall that Stoke is a stochastic super-optimizer leveraged by Strata for stochastic search.
Inconsistencies Found in Stoke  First, for instructions like `addsubpd %xmm1, %xmm2`, the order of addition and subtraction specified by Stoke is opposite to the one specified in the Intel Manual. Same is true with the mnemonic `addsubps`. (Found in 12 instruction variants.)

Second, the instruction `pslld %xmm1, %xmm2` implements a logical left shift of packed data by a count specified in %xmm1. Stoke’s specification vectorized the operand %xmm1 which is incorrect according to the manual. Similar issues were found in instructions implementing the logical right shift operations on packed data. (Found in 18 instructions.)

Third, `cvtsi2sdl %eax, %xmm1` and `cvtsi2sdl %eax, %xmm0, %xmm1` are respectively SSE- and AVX-versions of the instruction to convert a double-word (32-bit) integer to a scalar single-precision floating-point value. According to the manual, in the AVX-version, the destination bits 127 – 64 of the register %xmm1 are updated to the corresponding bits in the first source operand %xmm0. This is in contrast to the SSE-version of the instruction where the destination bits 127 – 64 should remain unmodified. Stoke specifies the semantics of the AVX-version similar to the SSE-version, which is incorrect. (Found in 4 instruction variants.)

Finally, some instructions, like `imulb %al`, which drive flag registers to an undefined state are not modeled correctly in Stoke. (Found in 8 instruction variants)

All these errors were reported and confirmed [5, 6].

5 Applications

In this section, we illustrate a few applications of our formal semantics, in addition to the reference model mentioned in the previous section. Our goal here is to explain that our semantics can be used for formal reasoning of x86-64 programs for a wide variety of purposes. For this reason, the applications are illustrative only, not meant to serve as a comprehensive evaluation or make any claim of scalability. Moreover, the reported performance of the applications is not optimized, and there is room for improvement, e.g., by providing custom abstractions and lemmas specific to x86-64, similarly to [48]. However, we believe that each application has the potential to be leveraged into a standalone tool, with its own user interface and case studies, but this is not our goal here. In fact, thanks to the language-parametric nature of $\mathcal{K}$, none of these reasoning approaches can be regarded as novel per se, because they are already used in the context of other languages defined in $\mathcal{K}$ and their implementation is language-semantics agnostic. We begin with a discussion of a use case for hardware verification.

5.1 Validating Processor Hardware

Verification is considered one of the most (if not the most) important challenges in modern processor design, for several reasons: (i) the enormous state spaces of modern systems; (ii) the lack of formal specifications in the state-of-practice, (iii) generating high quality test inputs for simulation, (iv) quantifying/analyzing the extent of coverage of simulation, and (v) generating a complete set of properties for checking. For post-silicon validation, an additional challenge is the difficulty of debugging and diagnosing observed erroneous behaviors. For all these reasons, verification is estimated to use 70% of the resources and time, while design takes only 30% [31].

A fully executable formal ISA-level specification such as the one developed here can improve the state of practice in verification in two significant ways.

First, it can provide a reliable specification of functional behavior of hardware with respect to observable states. This increases confidence in the input tests, for both directed and random test generation. High confidence tests can reduce time and increase focus during debugging, triage and diagnosis efforts. This is especially valuable in post-silicon validation, where observability within the chip is very limited, and functional validation is a key goal. This method can also help post-silicon failure diagnosis by identifying buggy input/output pairs and pinpointing specific erroneous output state bits.

Second, since our method can symbolically execute instructions, it can be used to generate input tests that have high coverage. While such analyses have been done at the detailed RTL level [19, 42, 43], we are unaware of similar tools at the x86-64 ISA level\textsuperscript{14}. The most significant advantage of such symbolic execution is the ability to detect corner case or hard to detect bugs [30, 43]. (This is analogous to finding security vulnerabilities due to corner-case software bugs, illustrated in Section 5.3, but applied to the hardware implementation instead of software.) We expect that ISA level symbolic analysis will uncover much subtle and complex bugs due to the higher level of abstraction and better design perspective than the RTL analyses.

A closely related challenge is checking the accuracy of ISA specifications, including reference manuals. By using such manual specifications to construct a formal specification, we may uncover errors in the manual specifications. This is explicitly demonstrated by the two bugs we discovered in the Intel x86-64 manual, while performing the instruction-level validation tests described in Section 4. These bugs were discovered as a result of running test cases using both the formal semantics generated by reading the manuals and the hardware, and finding a mismatch, then checking the manual specification carefully to determine whether the bug lies in the manuals or in the hardware. However, given an existing semantics, a far more valuable strategy would be to automatically generate human-readable documentation from the formal specification. A basic version of this strategy is likely quite feasible today, and much more sophisticated versions

\textsuperscript{14}It is worth mentioning the work by Martignoni et al. [44] about test-case generation for 32-bit x86 ISA.
The advances in concolic test generation, program synthesis, and x86-64 programs, we use the x86-64 verifier to prove the x86-64 semantics. Like in other deductive verifiers, repetitive addresses tinge others (denoted by "..."). It specifies the stack memory execution, and the value of the two parts: the top-level specification and the loop invariant. It specifies that relevant registers used in the program, omitting the irrelevant ones denoted by "...". Specifically, it specifies the behavior of an arbitrary loop iteration. That is, assuming the values of \( A \) and \( B \), resp., in the beginning of an arbitrary loop iteration, it specifies their final values in the end of the entire loop execution, which are 0 and \( B + A(A + 1)/2 \), respectively.

To demonstrate that our semantics can be verified to x86-64 programs, we use the x86-64 verifier to prove the functional correctness of the sum-to-n program as shown in Figure 3. It takes \( N \) as input and returns the sum from 1 to \( N \). The functional correctness can be essentially described as:

\[
\text{\%rax} = \sum_{n}^{N} n = N(N + 1)/2.\]

We present the actual specification that is fed to the x86-64 verifier. The specification has two parts: the top-level specification and the loop invariant.

Figure 4(a) shows the functional correctness specification of the sum-to-n program. The regstate cell specifies the relevant registers used in the program, omitting the irrelevant ones denoted by "...". Specifically, it specifies that \%rdi holds the value \( N \) without being updated during the program execution, and \%rax will have the expected return value. The memstate cell specifies the relevant part of the memory omitting others (denoted by "..."). It specifies the stack memory addresses -8(\%rbp) and -4(\%rbp) corresponding to \( n \) and \( s \), respectively.

The requires clause specifies the condition of \( N \) that prevents the arithmetic overflow. Figure 4(b) shows the loop invariant specification. It specifies the behavior of an arbitrary loop iteration. That is, assuming the values of \( n \) and \( s \) be \( A \) and \( B \), resp., in the beginning of an arbitrary loop iteration, it specifies their final values in the end of the entire loop execution, which are 0 and \( B + A(A + 1)/2 \), respectively.

\[
\begin{align*}
\text{\%rax} & = \sum_{n}^{N} n = N(N + 1)/2; \\
\text{\%rdi} & = A; \\
\text{\%rsi} & = B;
\end{align*}
\]

Note that when \( A + 1 \neq 0 \) and \( B < 2^{31} \), respectively.

(b) Loop invariant

Figure 4. Specification of sum-to-n program

5.2 Program Verification

The \( \mathcal{K} \) framework provides a language-parametric, reachability logic theorem prover [52, 61]. We instantiated it with our semantics, the engine can be used to symbolically execute and explore all possible paths in the given x86-64 program. In this section, we demonstrate how this capability can be used to find a security vulnerability.

\[
A = 0 \quad \text{and} \quad B < 2^{31} \quad \text{and} \quad B < 2^{31} \quad \text{and} \quad B < 2^{31}
\]

(a) Top-level specification

(b) Loop invariant

5.3 Symbolic Execution

\( \mathcal{K} \) automatically derives a correct-by-construction symbolic execution engine from the given semantics. Being instantiated with our semantics, the engine can be used to symbolically execute and explore all possible paths in the given x86-64 program. In this section, we demonstrate how this capability can be used to find a security vulnerability.
The (simplified) path condition provided by the symbolic

```
safe_addptr(int *of, uint64_t a, uint64_t b) {
    uintptr_t r = a + b;
    if (r < a) { \{ of = 1; return r; \}  // 'error state'
    else { return r; } \}  // 'safe state'
}
```

(a) C source code

```
" movl \(-8(\%ebp), \%edx \) \# a[32:0] moved to \%edx
movl \(-16(\%ebp), \%eax \) \# b[32:0] moved to \%eax
addl \%edx, \%eax \# r = a[32:0] + b[32:0]
movl 0x0, \%edx \# check if (32'0 ○ r) < a
```

(b) x86-64 assembly code in a 32-bit target

Figure 5. A security vulnerability in the HiStar kernel

Consider the code snippet of the HiStar [63] kernel, as shown in Figure 5(a)16, in which the KLEE [25] team found a security vulnerability. The safe_addptr function is supposed to compute the sum of two arguments a and b, setting the flag argument of when the arithmetic overflow occurs during the addition. That is, one of the functional correctness properties is that “*of = 1 if a + b > r”, where + is the mathematical addition (with no overflow). The functional correctness, however, is not satisfied when the source code is compiled to a 32-bit target, since the size of r becomes 32-bit (uintptr_t) while the sizes of a and b are still 64-bit (uint64_t).17 A suggested fix [25] is to change the conditional expression from r < a to r < a \(||\ r < b.

Using the symbolic execution engine derived from our semantics, we could find (in ~80 seconds) that, in the assembly code as shown in Figure 5(b), there exists a path that reaches L2 (i.e., the else branch) even if the addition overflow occurs. The (simplified) path condition provided by the symbolic execution engine is:

\[
a + b \geq 2^{32} \land \left( a + b \mod 2^{32} \right) \geq a
\]

where 0 \leq a < 2^{64} and 0 \leq b < 2^{64}. We asked Z3 to solve the above path condition and it returned a solution (i.e.,

\[
\text{int popcnt(uint64_t x) \{ int res = 0; for (; x > 0; x >>>= 1) \{ res += x & 0xffffffff; \} return res; \}
\]

Figure 6. popcnt program

a concrete input to trigger the security vulnerability): a = 0x00000000ffffffff and b = 0xffffffffffffffffffffff00000000.

5.4 Translation Validation of Optimizations

K also provides a program equivalence checker that can be used for the translation validation of compiler optimizations. We derived an x86-64 program equivalence checker from our semantics and used it to validate different optimizations. Figure 6 shows a program that we considered, popcnt, which counts the number of set bits in the given input.

We compiled the program with different optimizations: the GCC compiler optimizations (-O0, -O1, -O2, and -O3), and the Stoke super-optimization. On top of the baseline (-O0), the -O1 optimization produces a code obtained by performing the mem2reg optimization, the -O2 optimization produces one by factoring out the common statement over different branches,18 and the -O3 optimization produces the same code with -O2. The Stoke super-optimization translates the assembly code into a single instruction: popcnt %rdi, %rax, where %rdi and %rax correspond to the input and the return values, respectively.

We validated these optimizations by checking the equivalence between the optimized programs. The equivalence checker symbolically executes each program and compares their return values (i.e., the symbolic expression of the %rax register value) using Z3. It is able to prove successfully that all optimization variants are equivalent, i.e., to check the correctness of all these optimizations on popcnt.

Note that the symbolic execution of the popcnt program does not require an additional annotation about the loop because the number of loop iterations is bound to a constant (i.e., the bit-size of the input, 64).19 In general, however, the equivalence checker may require us to provide an additional annotation about loops, which can be automatically generated by augmenting the underlying compiler.

6 Limitations

Our limitations mostly include missing features of the x86-64 and execution environment, as follows.

16For the simplicity of the presentation, in Figure 5(b), we highlight only the key computations of the assembly compiled from the source. However, in our experiment, the full unmodified compilation is used for the symbolic execution.

17The function call safe_addptr(of, address, size) is used to validate that an user is allowed to access the memory range specified by the arguments address and size. The access is denied if an overflow occurs. A bug in the overflow detection might be exploited by an attacker to gain an access to a memory region beyond the control of the running process.
Floating Point Operations. Our testing shows that we have FP precision issues with instructions implementing the fused-multiply-add operation. This is because the current K’s floating point library [12] implementation lacks support of the FMA capabilities of the GNU MPFR library [15], which we plan to include in future.

Exceptions. We do not support exceptions, including the FP exceptions. Moreover, we do not distinguish between quiet and signaling NaN, i.e., all NaNs are quiet in our model. When the exception condition is encountered, execution proceeds after setting the exception flag.

Concurrency. Like the closest previous work [33, 37], we do not model concurrent semantics or the relaxed memory model as defined by other previous work [47, 56]. Our design, being parameterized on memory model, is amenable to accommodate others’, which we plan to achieve in future.

Instruction Decoding. The instruction decoding semantics is not modeled in the current work. However, we want to note that we have formalized the Intel XED disassembler algorithm, and we are working on formalizing the instruction decoding.

7 Related Work

There have been many projects that host a formal semantics of x86-64 either as their main contribution or as part of their infrastructure. This section summarizes such previous work and compares it to our formal semantics based on three directions that reflect the primary contributions of our work: the completeness of the definition in terms of supported user-level instructions, the faithfulness of the definition in terms of whether it is executable and hence can be evaluated with real code execution, and the generality of the definition in terms of its applicability to formal reasoning analyses.

Strata [37] uses program synthesis to generate the instruction semantics, as SMT bit-vector formulas. Strata’s semantics definition for x86-64 is incomplete as it does not include semantics of about 40% of the user-level instructions in scope that are not straightforward to automatically learn by the presented algorithm for reasons mainly due to limitations of underlying synthesis engine. Moreover, the specifications are executable only for non-float-point (FP) instructions and have not been demonstrated to be usable in a formal analysis setting out-of-the-box.

A contemporary work by Roessle et al. [53] presents a method to extract the big step semantics of a sequential binary program using the small step instruction semantics extracted mostly from Strata [37] plus some manually drafted support for branching and stack operating instructions. Like Strata, their specification is executable only for the non-floating point instructions. However, the work does not aim for completeness of semantics, which is one of our primary goals.

Goel et al. [33] use the ACL2 theorem prover [38] to model the x86-64 ISA and has support for 33% of all user level instructions [17]. They have specifications of a selection of user-level instructions, some system-level instructions, paging, and segmentation. This list is far from a complete semantic definition of x86-64, however it is still state-of-the-art in terms of formal analysis applied directly to x86-64 code. It is also an executable definition as demonstrated in its use for simulations. In our work we do not leverage this definition, since we found a more complete definition such as Strata to be a better starting point towards completeness.

The CompCert verified compiler [40] includes semantics definitions for all intermediate and target languages used within the compiler, including a definition for x86 assembly. The definition is specified in Coq [14], and has been used in a formal setting for proving the correctness of the CompCert’s compilation step to assembly, as well as outside CompCert, e.g., in proofs relating to the certified concurrent OS kernel CeriKOS [34]. However, this definition focuses on the 32-bit x86 instruction set, which is a subset of the x86-64 instruction set. Moreover, it is part of the trust base for CompCert and it is not clear whether or how it has been tested against an actual processor, whereas Strata and ours have been extensively tested.

TSL [41] is a system that can auto-generate tools for various machine code analyses given a semantics definition of the machine language written in the TSL specification. Such a semantics definition for the integer instructions (i.e., no floating-point instructions) of the 32-bit x86 instruction set is given as part of the project. It is used for the generation of various tools, including a machine code synthesizer [60]. This definition, to the best of our knowledge, have not been used for formal specification proofs, i.e., to prove whether a given x86 program meets its specification.

Our semantics, like all of the aforementioned works, uses a sequential consistency memory model, omitting weaker memory models. Existing efforts to specify weaker memory models for x86-64 such as Owens et al. [47] and Sarkar et al. [56], however, suffer from their limited support for instruction semantics (i.e., they consider only a small subset of 32-bit x86 instruction set). We believe that integrating these two complementary efforts is a promising direction toward rigorously reasoning about real-world programs running on modern multiprocessors (e.g., using the Sail framework as we will describe below).

Sail is another language semantics framework, tailored for describing an instruction-set architecture semantics. Sail has been used to specify the semantics of ARMv8-A, RISC-V, and CHERI-MIPS [21], as well as the semantics of a small subset of x86 [2]. Sail is similar to the K framework we employed, but K is for more general-purpose than Sail. Also, the Sail x86 semantics is very limited than ours. It describes

---

20There are some minor omissions on immediate instructions with 8-bit operands for which Strata learns 256 brute force formulas.
the semantics of a fragment of 32-bit user-mode x86 instructions, while ours covers also the 64-bit counterpart as well as the floating-point instructions. Sail, however, allows us to integrate a semantic definition with their relaxed memory models [18, 49] for concurrency semantics. We believe that (automatically) translating our semantics into Sail\(^\text{21}\) is a promising direction to obtain concurrency semantics and thus enable concurrency reasoning for x86 programs, which we leave as future work.

On the other hand, the key differentiator of our effort compared to the aforementioned existing work is that our semantics achieves (A) completeness of supported user-level instructions, (B) faithfulness, and (C) applicability to formal reasoning analyses, at the same time. In Section 8, we will elaborate our novel approaches that allow us to achieve the unique combination.

There are various binary analysis projects that target x86-64 binaries and lift them to a higher-level representation more suitable for the specific analysis. These include Angr [3] using the VEX IR of Valgrind [46], the QEMU [23] emulator using the TCG IR, the software fault isolation tool RockSalt [45] using its own RTL DSL, the disassembler and binary analyzer Radare2 [20] using the ESIL IR [9], the binary analysis tool BAP [24] using the BIL IR, and the static binary translator Remill [55] using LLVM IR [39]. We refer these semantics as indirect as they give the semantics of the x86-64 binary via the translation to their IR, as opposed to the direct semantics such as ours. Note that the direct semantics has its own value. For example, without the direct semantics of x86-64, we cannot even formulate the correctness of a translator from x86-64 to the IR. Analogously, many programming languages (C, C++, Java, etc.) have been given direct semantics, instead of indirect semantics by translation to other languages, for formal reasoning at the desired language granularity.

Hasabnis et al. [35, 36] also present an indirect semantics of x86-64, but in contrast to other indirect semantics, they use machine learning [36] and symbolic execution [35] to automatically learn the translation of x86-64 instructions to their IR, by extracting knowledge from the hard-coded translation logic of compilers such as GCC. However, as they admitted [35], their semantics omits some important details of x86-64 semantics (e.g., the effect of various instructions on CPU flags), and thus is not sufficient to serve as a solid foundation for rigorous formal analyses of x86-64 binary.

8 Lessons Learned

Here we present the lessons we learned during our semantics development, identifying important aspects to be considered,

\(^{21}\)Indeed, the Sail ARMv8-A semantics is automatically generated from the ARM-internal specification of ARMv8-A [51] written in the ARM’s architecture specification language, ASL [50], by using the ASL-to-Sail translator [21].

and clarifying a good practice for developing a large ISA semantics. We also elaborate the novel aspects of our semantics development approach that allow us to obtain the complete and faithful semantics with an affordable amount of effort.

Regarding automatic semantics synthesis  Section 3.2 reports the empirical negative result that a fully automatic synthesis of the whole x86-64 semantics is difficult to achieve. Specifically, in a vast instruction set like x86-64, it is common that many instructions can be grouped together where the instructions of each group are similar to each other except for a few differences. An automatic synthesis technique, using the stratification approach [37], would effectively synthesize such instruction variants’ semantics, provided that the semantics of its representative instructions are given in advance. \(^{22}\) The problem is, however, that it is non-trivial to properly partition all the instructions into such groups, providing the representative instruction semantics for each group, without a priori knowledge about the semantics of all instructions. The vanilla stratification approach [37] turned out to be not sufficient to solve this dilemma, leaving a substantial part of semantics unspecified. Thus, we decided to manually provide the information about partition and representatives, for which we had to consult the manual to obtain knowledge about the remaining part of semantics. Once we obtained the required knowledge, however, we realized that it would be more straightforward to directly turn the knowledge into the semantics than going through the synthesis process, and thus we ended up manually specifying the remaining part of semantics in this work. This (negative) experience led us to think about a new research problem: how to obtain the prior knowledge about the semantics that is enough to partition instructions and identify their representatives, without manually examining the entire manual. This knowledge does not need to be precise, and we believe that such rough information can be automatically extracted from the manual using text processing. We plan to explore this idea while defining the semantics of unspecified and/or new instructions.

Most of the previous efforts in formalizing x86-64 semantics can be categorized based on whether the underlying approach is fully manual [2, 33, 40, 41] or fully automatic [35–37, 53]. We note that none of these approaches, when used in isolation, sufficiently scale to a complete and faithful semantics, as much as ours that combines these complementary approaches and benefits from each other.

\(^{22}\)For certain complex instructions, the size of their group is very small (i.e., they are quite different to each other), and thus the automatic synthesis would not yield a sufficient gain over the effort of specifying the semantics of their representatives, but we found that the number of such isolated instructions of x86-64 is small.
Modeling and executing implementation-dependent behaviors  Like many other programming languages, the x86-64 ISA standard admits implementation-dependent behaviors, that is, each processor implementation can freely choose specific behaviors (Section 1.1). There are two natural, faithful ways of specifying the implementation-dependent behaviors. One is to parameterize the semantics over the implementation dependent behaviors, and later instantiate it with a profile that describes specific behaviors taken by the processor of interest. Another is to introduce non-determinism in the semantics. We took the second approach in this work, since the implementation-dependent behaviors of x86 are quite limited and mostly localized (e.g., as shown in Figure 2).

However, the non-determinism makes it non-trivial to execute (and validate) the semantics. For example, in the hardware co-simulation, the output of hardware execution may vary depending on the underlying processors, while the non-deterministic semantics execution will randomly choose certain behavior, which may be different from the specific behavior implemented in a processor. To mitigate this issue, for the hardware co-simulation, we symbolically executed the semantics where the non-deterministic behaviors are represented symbolically, so that the symbolic output captures all possible behaviors, and then we checked if the hardware output is matched by (an instance of) the symbolic output.

We note that faithfully modeling the implementation-dependent behaviors is necessary for the correctness of the semantics. As mentioned in Section 4.2, however, Stoke [57] does not faithfully model such behaviors, causing certain errors in their semantics that we revealed [5]. On the other hand, most of other existing direct x86-64 semantics employ an approach similar to the ones described above, faithfully modeling the implementation-dependent behaviors. For example, Goel et al. [32] models such behaviors using a constraint function which is guaranteed to be unique and non-deterministic, while they employ the aforementioned profiling approach for concrete execution. TSL [41] makes both approaches available, from which their users can choose.

Employing multiple semantic engineering frameworks  We found that employing multiple semantic frameworks is helpful. Specifically, we employed the two semantic frameworks, \( \text{K} \) and Stoke, where we enjoyed all of their (executive) benefits that make it easier for us to write and validate the semantics, and utilize the semantics in various applications. For example, we wrote the semantics of certain complicated instructions (e.g., \texttt{pcmpestrm}, \texttt{pcmpestr}, and \texttt{pclmulqdq}) in \( \text{K} \), as \( \text{K} \) provides an easy way to specify behaviors with multiple cases, while Stoke would have required us to write a big nested if-then-else expression which is not amenable. For another example, we utilized Stoke to validate most of our instruction semantics as Stoke provides a infrastructure [23] for the hardware co-simulation, whereas we employed \( \text{K} \) to validate the semantics of floating-point instructions as Stoke does not support executing floating-point operations while \( \text{K} \) does.

In order to use the two frameworks interchangeably, we had to develop a translator that translates the semantics in one framework to another. To convince the correctness of the translation, we employed translation validation, where we verified equivalence between the original and the translated semantics for each instruction using the Z3 SMT solver.

Employing multiple frameworks, with the validated translation, allowed us to utilize their exclusive features to efficiently specify (e.g., by using the well-designed specification language of one framework) and validate (e.g., by using the well-developed testing infrastructure of another framework) the semantics, which highly expedited our semantics development process, and thus significantly contributed to the completeness of our semantics. Moreover, employing multiple frameworks in general allows us to immediately benefit from all of their formal analysis tools, increasing the applicability of the semantics in various formal reasoning tasks. Existing semantics development efforts (e.g., [33, 37]), however, employ a single framework without utilizing the potential of other frameworks, which otherwise might have improved completeness and/or faithfulness of their semantics with the same amount of effort.

9 Conclusion  We have presented the most complete formal semantics of x86-64 user-level instructions to date, which has been thoroughly tested, with illustration of potential applications. Moreover, the \( \text{K} \) framework enables us to represent a semantics as SMT theories, which other projects can leverage for their own purposes.

References  


