dump of labs

author: Matthew Sotoudeh <matthew@masot.net> 2023-06-03 14:51:16 -0700
committer: Matthew Sotoudeh <matthew@masot.net> 2023-06-03 14:51:16 -0700
commit: d8c82a8431b10de05ce5921d98fa7ec60ecf5df0 (patch)
tree: dcebb9ba5964e11060e0659657844db1683bed5d /smt-symex
12 files changed, 962 insertions, 0 deletions
diff --git a/smt-symex/HINTS.md b/smt-symex/HINTS.md
new file mode 100644
index 0000000..6545eff
--- /dev/null
+++ b/smt-symex/HINTS.md
@@ -0,0 +1,149 @@
+# Hints
+This lab ended up being a bit longer than I expected. Here I've compiled some
+hints you may find helpful. There are multiple ways to encode these things, and
+the encodings here are not the best: feel free to experiment!
+
+### Encoding bv_eq
+Suppose we have the following code:
+```
+// Create bitvectors x, y with width 2
+int x = new_bv(2),
+    y = new_bv(2);
+// Assert that they're disequal
+clause(-bv_eq(x, y));
+```
+The calls to `new_bv` allocate fresh "SAT-land" variables for each bit of `x`
+and `y`. Suppose `x` has SAT-land bits `1, 2` and `y` has `3, 4`. The goal of
+`bv_eq` is to create a SAT-land literal that is true if and only if `x` and `y`
+have the same bits. We can encode this like:
+```
+5 <=> (1 <=> 3)
+6 <=> (2 <=> 4)
+7 <=> 5 and 6
+```
+
+Now how to encode `5 <=> (1 <=> 3)`? To do this we need to encode:
+```
+5 => (1 <=> 3)
+(1 <=> 3) => 5
+```
+It will turn out that using a different encoding of `(1 <=> 3)` for the first
+vs. second will be helpful to simplify this.
+
+Notice that `1 <=> 3` can be encoded as either:
+- `(1 => 3) and (3 => 1)`, i.e., `{-1, 3} and {-3, 1}` OR
+- `(1 and 3) or (-1 and -3)`.
+
+So let's encode `5 => (1 <=> 3)` as
+```
+5 => {-1, 3} and {-3, 1}
+is the same as:
+5 => {-1, 3}
+5 => {-3, 1}
+is the same as:
+{-5, -1, 3}
+{-5, -3, 1}
+```
+And let's encode `(1 <=> 3) => 5` as
+```
+(1 and 3) or (-1 and -3) => 5
+is the same as:
+(1 and 3) => 5
+(-1 and -3) => 5
+is the same as:
+{-1, -3, 5}
+{1, 3, 5}
+```
+
+Now we're done with the first thing we wanted to encode. Doing the same trick
+lets you encode `6 <=> (2 <=> 4)`. All that's left is `7 <=> 5 and 6`. This is:
+```
+7 => 5 and 6
+5 and 6 => 7
+is the same as
+7 => 5
+7 => 6
+5 and 6 => 7
+is the same as
+{-7, 5}
+{-7, 6}
+{-5, -6, 7}
+```
+For the very last clause, you'll want to use the `clause_arr` function rather
+than `clause` (since it can be arbitrarily large depending on the variable's
+bitwidth).
+
+And that's `bv_eq`!
+
+### Bitvector Addition
+Think lookup table for a ripple-carry adder. Create some temporary SAT-land
+variables for the carry bits, fix the first carry bit to false, and then add
+clauses that compute output & carry bits from the input & prior carry bits.
+This should use 16 clauses per loop iteration, looking something like:
+```
+1 + 1 + 1 = 11b
+(x[i] and y[i] and c[i]) => z[i]
+(x[i] and y[i] and c[i]) => c[i]
+
+1 + 1 + 0 = 10b
+(x[i] and y[i] and -c[i]) => -z[i]
+(x[i] and y[i] and -c[i]) => c[i]
+
+Etc...
+```
+Once you have the first row written out it's just a bunch of copy/paste.
+
+### Encoding Arrays
+The one big thing to know about our array encoding is that arrays don't show up
+in the SAT encoding until the very, very end (`array_axioms` called by
+`solve`). Until then, the array operations are basically just tracking metadata
+about bitvectors, e.g., "bitvector 10 is equal to the value at index given by
+bitvector 6 in array 2." Once we are asked to solve the constraints, though, we
+have to somehow communicate this array information to the SAT solver. This is
+done with the "read-over-write" axioms. To explain these, suppose you have code
+like this (ignoring bitvector widths):
+
+```
+int k1 = new_bv(), k2 = new_bv(), k3 = new_bv(), k4 = new_bv(),
+    v1 = new_bv(), v2 = new_bv();
+int a1 = new_array();
+int a2 = array_set(a1, k1, v1);
+int x1 = array_get(a2, k2);
+int a3 = array_set(a2, k3, v2);
+int x2 = array_get(a3, k4);
+```
+(Note: `array_set` is not "in-place;" think of it as copying the source array,
+changing one index in the copy, and then returning the copy.)
+
+If you encode these bitvectors without adding any array constraints (basically,
+no clauses for this example), what could go wrong? Well, the SAT solver could
+give you back anything. E.g., it could give a solution where `k1 = k2` but
+`x1 != v1`. This is clearly wrong: if you set key `k1` to value `v1`, and then
+check the value at location `k2 = k1`, you should get back something equal to
+the value you stored! We encode that for this read like so:
+```
+(k2 = k1) => (k1 = v1)
+```
+(Note in `array_axioms` you *can* call `bv_eq`!)
+
+We need to do the same for the read of `k4` on the last line. Clearly, if we
+read at `k3` (where we just wrote) we need to return what we stored:
+```
+(k4 = k3) => (x2 = v2)
+```
+On the other hand, if we did not read what we just assigned to but we did read
+what we assigned to earlier, we need to get back what we stored earlier:
+```
+((k4 != k3) and (k4 = k1)) => (x2 = v1)
+```
+
+There's one last think that can go wrong. Suppose `k4 != k3`, `k4 != k1`,
+`k2 != k1`, *but* `k4 = k2`. In other words, we never read anything we
+explicitly set, but we read the same key twice. With just the clauses we added
+so far, the SAT solver could still give us a solution where `x1 != x2` even
+though `k4 = k2`. To rule this out, we need to add a clause:
+```
+((k4 != k3) and (k4 = k2)) => (x2 = x1)
+```
+
+And that's arrays!
diff --git a/smt-symex/PRELAB.md b/smt-symex/PRELAB.md
new file mode 100644
index 0000000..98b7862
--- /dev/null
+++ b/smt-symex/PRELAB.md
@@ -0,0 +1,71 @@
+# SAT/Symex+SMT Labs Prelab
+
+By Wednesday, we will make a symbolic execution (symex) engine for a simple
+little "assembly language." Think of a symex engine as a tool that tries to
+generate inputs to your program that cause it to execute a certain line. For
+example, in this program:
+```
+load 0 0        ; r0 = memory[r0]
+immediate 1 10  ; r1 = 10
+branch_eq 0 1 2 ; if r0 == r1 goto l1 else goto l2
+l1: fail
+l2: exit
+```
+The symex engine will attempt to find initial registers & memory values that
+cause the `fail` instruction to be reached. In this case, it will say setting
+`r0 = 0`, `memory[0] = 10` does the trick.
+
+Symex engines are built on SMT solvers which are built on SAT solvers.
+Roughly, SAT solvers are symex for when you have no branching, no memory, and
+finitely-many boolean registers. The only operations allowed are "and," "or,"
+and "not." SMT solvers add support for int (non-boolean) values and memory.
+Symex engines add support for branching, loops, and some assembly/code syntax.
+
+### Monday: SAT Solver Prelab
+SAT solvers solve _boolean constraints_. For example, you can say "find an
+assignment for $x$, $y$, $z$ such that: (i) $z$ is true, (ii) if $x$ and $y$
+are true then $z$ is false, and (iii) either $x$ or $y$ is true."
+
+In order to keep SAT solvers as simple and fast as possible, the input language
+describing constraints for most SAT solvers is pretty restrictive. Here's an
+example:
+```
+c Amazing comment!!
+p cnf 3 3
+3 0
+-1 -2 -3 0
+1 2 0
+```
+- Initial lines can start with `c` for `c`omments.
+- The `p` (for `p`roblem?) line tells us how many variables and how many
+  clauses (remaining lines).
+- The constraints are given in AND of OR form; each line represents an OR and
+  every line has to be satisfied (AND).
+- Every line ends in `0` (variables are numbered starting from `1`).
+- `n` means that variable should be true, `-n` means false.
+- The basic idea is to assign each variable true or false so that every line
+  "predicts" at least one variable's true/falseness correctly.
+In the above example, line `3 0` means "variable 3 should be true," the next
+line means "either variable 1 is false, or variable 2 is false, or variable 3
+is false," and the last line means "either variable 1 is true or variable 2 is
+true." If you think about it, those constraints are exactly the same as (i),
+(ii), (iii) from earlier.
+
+You can plug this in to minsat (or the [web
+version](http://logicrunch.it.uu.se:4096/~wv/minisat/)) to get a solution:
+```
+SATISFIABLE
+
+-1 2 3 0
+```
+This means there is a solution, namely, variable 1 false, variable 2 true, and
+variable 3 true.
+
+We'll be writing our own version of minisat, including a few hacks to speed it
+up. We'll be basing our version on the SAT solver called Chaff, which you
+should read about
+[here](https://www.princeton.edu/~chaff/publication/DAC2001v56.pdf)
+(specifically sections 1 and 2!).
+
+### SMT/Symex Prelab
+TODO
diff --git a/smt-symex/README.md b/smt-symex/README.md
new file mode 100644
index 0000000..4e74687
--- /dev/null
+++ b/smt-symex/README.md
@@ -0,0 +1,116 @@
+# SMT + SymEx Lab
+
+The goal of this lab is to build a working symbolic execution engine off of our
+SAT solver.
+
+### STOP!!!
+Before you proceed, make sure you do one of:
+- If you finished the "fast" SAT solver last week, copy the `fast` binary into
+  `code/` as `code/fast`.
+- If you only finished the "basic" SAT solver last week, copy the `basic`
+  binary into `code/` as `code/fast` (yes, pretend it's fast!).
+- If you finished neither, but downloaded minisat and it's on your path, pass
+  `-DUSE_MINISAT` to all compilation steps (see the Makefile).
+
+### SMT Solver
+The core of our symex engine will be an SMT solver. An SMT solver is a nice
+frontend to a SAT solver, allowing you to construct constraints involving
+arrays and bitvectors (finite bitwidth integers, like in C).
+
+For simplicity, we'll be building an eager SMT solver, also known as bit
+blasting. This is the same fundamental approach used by the STP solver:
+https://github.com/stp/stp
+
+The idea is to "compile" the higher-level constraints into a big SAT problem,
+then use our SAT solver to solve this problem.  This approach is nice because
+it makes a clean separation between the SMT solver and SAT solver; you can test
+against your SMT solver against any SAT solver to better localize bugs, and we
+don't have to change our SAT solver at all.
+
+When you create a bitvector of size N, it actually creates N new bits in the
+SAT problem representing the N bits of the bitvector. The method `bv_eq(b1,b2)`
+creates a new SAT variable `v_eq` and spits out clauses ensuring that `v_eq` is
+true if and only if all of the bits in b1 and b2 are identical. Similar for
+`bv_add(b1,b2)` --- we essentially spit out a ripple-carry adder in the
+underlying SAT constraints.
+
+Arrays are a bit more complicated. When the user requests `arr[x]`, we give
+back a fresh bitvector and record separately that it's supposed to be `arr[x]`.
+Then, before spitting out the final DIMACS file, we look at every earlier call
+to `arr[y]` and add assertions that `arr[x] = arr[y]` if `x = y` (being careful
+to handle cases where we overwrite `arr[x]`!).
+
+Then, we print the constraints to a DIMACS file, run our SAT solver, and read
+back in the results.
+
+Some tips/reminders for the SAT encoding:
+- `x <=> y` is the same as `x => y` and `y => x`
+- `x => z` is equivalent to the clause `{-x, z}`
+- `x and y => z` is `{-x, -y, z}`
+- `x or y => z` is `x => z` and `y => z`
+- `x => y and z` is `x => y` and `x => z`
+- `x => y or z` is `{-x, y, z}`
+
+I have provided some "unit tests" for the SMT solver which you can run with
+`make do_tests`. I suggest the following order of implementation:
+1. Implement `solve` and `get_solution`
+2. Implement `new_bv`, `const_bv`, `bv_eq`
+3. Run `make do_tests` --- the first test (`test_basic_bv`) should pass
+4. Implement `bv_add`
+5. Run `make do_tests` --- the first two tests (`test_basic_bv`, `test_bv_add`)
+   should pass
+6. Implement `new_array`, `array_store`, `array_get`, `array_axioms`
+7. Run `make do_tests` --- all tests should pass
+
+Then you should be in a good position to move on to the SymEx engine!
+
+### Symbolic Interpreter
+We'll work on a simple little assembly-like IR. This IR has:
+- Arbitrarily many registers
+- A word size determined by the `WORD_SIZE` global in `main.c`
+- A heap memory indexed by registers
+- A branching instruction that jumps to a relative offset if two registers have
+  the same value
+- A "failure" instruction that indicates any path reaching this is a bug
+
+The idea is to basically implement a little interpreter for this language
+except:
+1. Instead of using concrete integers for register/memory values, use
+   bitvectors and array operations as exposed by the SMT solver library.
+2. When a branch statement is reached, simply fork the interpreter. In one
+   process, assert the branch is taken and proceed. In the other, assert it's
+   not and proceed.
+3. When a failure statement is reached, try to solve the current path
+   constraints. If a solution is found, that represents an input that can reach
+   this failure location, i.e., a bug. If we visit all possible paths and none
+   can reach fail, we've proven the absence of such bugs.
+
+The trickiest part is actually printing out the solution at the end. We need to
+track the first time we use a register, as well as the first time we read a
+certain value in memory (since these represent possible inputs to our program).
+
+All you need to do is implement the instructions symbolically by calling out to
+the methods in `smt.h`. Use `get_register` and `set_register` to get/set
+bitvectors representing register values; these methods will automatically
+update IN_VARS for register operations, but you'll need to do it yourself for
+memory operations.
+
+### Symbolic Interpreter Without Arrays
+If you want to try doing the symex engine without finishing the arrays portion
+of `smt.c`, you can do that too. You should be able to run
+`test_programs/test_prog`.
+
+### Extensions
+- Extend the symex IR to support more operations
+- Write a compiler (or interpreter!) from a higher-level language (some useful
+  subset of C?) to the symex IR
+- Modern SMT solvers often use a *lazy* encoding instead of eager/bitblasting.
+  The idea is to only give the SAT solver a subset of the clauses; if it says
+  "unsat" with that subset, you're already done. Otherwise, check if its
+  solution also satisfies the remaining clauses. If not, give it a few more
+  clauses until those clauses rule out the current attempted solution. Repeat
+  this until you either get unsat or find a solution that works for all the
+  clauses. This is often faster in part because it lets you check the remaining
+  clauses however you want, without having to somehow encode them into CNF.
+  Also, you may not need all the clauses to prove unsat (or even SAT), so just
+  adding them as-needed is a good idea.
diff --git a/smt-symex/code/Makefile b/smt-symex/code/Makefile
new file mode 100644
index 0000000..55950d8
--- /dev/null
+++ b/smt-symex/code/Makefile
@@ -0,0 +1,30 @@
+.PHONY: clean all
+
+CFLAGS += -g
+CFLAGS += -O3
+# CFLAGS += -fsanitize=address
+
+all: main test_basic_bv test_bv_add test_arrays
+
+main: main.c smt.o
+	mkdir -p temp_files
+	$(CC) $(CFLAGS) $^ -o $@
+
+test_basic_bv: test_cases/test_basic_bv.c smt.o
+	$(CC) $(CFLAGS) $^ -o $@ -I.
+
+test_bv_add: test_cases/test_bv_add.c smt.o
+	$(CC) $(CFLAGS) $^ -o $@ -I.
+
+test_arrays: test_cases/test_arrays.c smt.o
+	$(CC) $(CFLAGS) $^ -o $@ -I.
+
+do_tests: test_basic_bv test_bv_add test_arrays
+	./test_basic_bv
+	./test_bv_add
+	./test_arrays
+
+smt.o: smt.c
+
+clean:
+	rm -rf *.o main temp_files
diff --git a/smt-symex/code/main.c b/smt-symex/code/main.c
new file mode 100644
index 0000000..1aba62c
--- /dev/null
+++ b/smt-symex/code/main.c
@@ -0,0 +1,190 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdint.h>
+#include <assert.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include "smt.h"
+
+/**** GLOBALS ****/
+
+// This is the bitvector width to use, i.e., the word size of the machine. Note
+// that the underlying SMT library will start to fail if this is greater than
+// 32, and will *definitely* fail if this is greater than 63.
+static const int WORD_SIZE = 4;
+
+// Each instruction in our program has one of these opcodes
+enum op {
+    OP_EXIT = 0,    // End program
+    OP_IMMEDIATE,   // Copy an immediate into a register
+    OP_MOVE,        // Copy one register into another
+    OP_STORE,       // Move one register into the memory pointed to by another
+    OP_LOAD,        // Move memory pointed to by one register into another
+    OP_BRANCH_EQ,   // If the two registers have identical values, jump
+    OP_FAIL,        // If this point is reached, there is a bug
+    OP_ADD,         // Addition of registers
+};
+
+static const char *op_names[] = {
+    "exit", "immediate", "move", "store", "load", "branch_eq", "fail", "add",
+};
+
+// An instruction is an opcode and some arguments
+struct instruction {
+    enum op op;
+    int64_t args[3];
+};
+
+// We store the program as an array of instructions
+static struct instruction *PROGRAM = NULL;
+static int N_PROGRAM = 0;
+
+// An input variable is a register whose value we use without ever loading or
+// moving a value into it, *OR* a memory read that hasn't been read before. We
+// use this list to print out just the parts of the solution that we need (the
+// idea is everything else is just a function of these). Every initial read
+// from an unassigned register is stored here, along with every memory
+// load/store (they will be dedup'd at the printing stage).
+struct in_var {
+    int register_number;
+    int bv_id;
+
+    // If this is a memory lookup...
+    int bv_key, is_store;
+};
+static struct in_var *IN_VARS = NULL;
+static int N_IN_VARS = 0;
+
+static int *REGS = NULL, N_REGS = 0;
+static void set_register(int n, int val) {
+    while (n >= N_REGS)
+        APPEND_GLOBAL(REGS) = -1;
+    REGS[n] = val;
+}
+
+static int get_register(int n) {
+    if (n < N_REGS && REGS[n] >= 0) return REGS[n];
+    set_register(n, new_bv(WORD_SIZE));
+    APPEND_GLOBAL(IN_VARS) = (struct in_var){
+        .register_number = n,
+        .bv_id = REGS[n],
+    };
+    return REGS[n];
+}
+
+/**** main symbolic interpreter ****/
+int main() {
+    // Parse the input program
+    char opcode[20];
+    int64_t args[3];
+    while (!feof(stdin)) {
+        memset(args, 0, sizeof(args));
+        assert(scanf("%s %ld %ld %ld ", opcode, &(args[0]), &(args[1]), &(args[2])));
+        if (!strcmp(opcode, ";")) {
+            while (getc(stdin) != '\n');
+            continue;
+        }
+        enum op op = OP_EXIT;
+        for (int i = 0; i < sizeof(op_names) / sizeof(op_names[0]); i++) {
+            if (!strcmp(opcode, op_names[i])) {
+                op = i;
+                break;
+            }
+        }
+        APPEND_GLOBAL(PROGRAM).op = op;
+        memcpy(PROGRAM[N_PROGRAM - 1].args, args, sizeof(args));
+    }
+
+    // Create an array to represent program memory (uncomment once your solver
+    // supports arrays!)
+    // int array_memory = new_array();
+
+    // Do the symbolic interpretation. We generally use dst src ordering.
+    int pc = 0;
+    while (pc < N_PROGRAM) {
+        struct instruction instruction = PROGRAM[pc++];
+        printf("[%d] %s\n", getpid(), op_names[instruction.op]);
+        int bv_lhs, bv_rhs, bv_dst_addr, bv_src;
+        switch (instruction.op) {
+            // Register Operations
+            case OP_IMMEDIATE:  // immediate dstreg value
+                assert(!"Implement me");
+                break;
+            case OP_MOVE:       // move dstreg srcreg
+                assert(!"Implement me");
+                break;
+            case OP_ADD:        // add dstreg op1reg op2reg
+                assert(!"Implement me");
+                break;
+
+            // Memory operations
+            case OP_STORE:      // store addrreg valuereg
+                // Get a bv for the destination address register and source
+                // register value, then do a set to memory.
+                // IMPORTANT: Afterwards, append to IN_VARS like so:
+                // APPEND_GLOBAL(IN_VARS) = (struct in_var){
+                //     .register_number = -1,
+                //     .bv_key = bv_dst_addr,
+                //     .is_store = 1,
+                // };
+                assert(!"Implement me");
+                break;
+            case OP_LOAD:       // load addrreg dstvaluereg
+                // Get a bv for the source address register, do a get from
+                // memory, then set the destination register to the result.
+                // IMPORTANT: Afterwards, append to IN_VARS like so:
+                // APPEND_GLOBAL(IN_VARS) = (struct in_var){
+                //     .register_number = -1,
+                //     .bv_id = bv_dst_register_value,
+                //     .bv_key = bv_source_address_register,
+                //     .is_store = 0,
+                // };
+                assert(!"Implement me");
+                break;
+
+            // Branching
+            case OP_BRANCH_EQ:  // branch_eq op1 op2 offset_if_eq
+                // Get bvs for the two register arguments. Fork, asserting
+                // equality/disequality in each process and update the pc as
+                // necessary. Note that, if the values are equal, we should
+                // jump to the pc of this instruction + offset_if_eq. But above
+                // we've already done pc++, so you need one fewer.
+                assert(!"Implement me");
+                break;
+
+            // Success & failure
+            case OP_FAIL:
+                if (solve()) {
+                    printf("[%d] Found solution:\n", getpid());
+                    for (int i = 0; i < N_IN_VARS; i++) {
+                        if (IN_VARS[i].register_number >= 0) {
+                            printf("\t[%d] Register %d = %ld\n", getpid(),
+                                   IN_VARS[i].register_number,
+                                   get_solution(IN_VARS[i].bv_id, 0));
+                        } else if (IN_VARS[i].is_store) {
+                            continue;
+                        } else {
+                            int key = get_solution(IN_VARS[i].bv_key, 0);
+                            // Make sure we haven't read from/wrote to this key
+                            // before
+                            int found = 0;
+                            for (int j = 0; j < i; j++) {
+                                if (IN_VARS[j].register_number == -1
+                                        && get_solution(IN_VARS[j].bv_key, 0) == key) {
+                                    found = 1;
+                                    break;
+                                }
+                            }
+                            if (found) continue;
+                            printf("\t[%d] Memory[%d] = %ld\n", getpid(),
+                                   key, get_solution(IN_VARS[i].bv_id, 0));
+                        }
+                    }
+                }
+            case OP_EXIT:
+                return 0;
+        }
+    }
+    return 0;
+}
diff --git a/smt-symex/code/smt.c b/smt-symex/code/smt.c
new file mode 100644
index 0000000..dec99b2
--- /dev/null
+++ b/smt-symex/code/smt.c
@@ -0,0 +1,220 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <string.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include "smt.h"
+
+// Keep track of clauses to be sent to the SAT solver
+struct clause {
+    sat_lit *literals;
+    int n_literals;
+};
+static struct clause *CLAUSES = NULL;
+static int N_CLAUSES = 0;
+
+void clause_arr(int *literals) {
+    int i = N_CLAUSES;
+    APPEND_GLOBAL(CLAUSES) = (struct clause){
+        .literals = NULL,
+        .n_literals = 0,
+    };
+    for (; *literals; literals++) {
+        APPEND_FIELD(CLAUSES[i], literals) = *literals;
+    }
+}
+
+// Keep track of arrays in a tree structure. (store a k v) creates a new array
+// whose parent is a, bv_store_key is k, and bv_store_value is v. Whenever a
+// get is done on the array, a corresponding bv_lookup is appended to its list.
+struct bv_lookup {
+    bv_h bv_key, bv_value;
+};
+
+struct array {
+    array_h array_parent;
+
+    bv_h bv_store_key, bv_store_value;
+
+    struct bv_lookup *bv_lookups;
+    int n_bv_lookups;
+};
+static struct array *ARRAYS = NULL;
+static int N_ARRAYS = 0;
+
+// Keep track of bitvectors
+struct bv {
+    sat_lit *bits;
+    int n_bits;
+};
+static struct bv *BVS = NULL;
+static int N_BVS = 0;
+
+// You can get a fresh SAT variable like NEXT_SAT_VAR++
+static sat_lit NEXT_SAT_VAR = 1;
+
+array_h new_array() {
+    // Create a new array. It has no parent (-1) and no lookups yet (NULL, 0).
+    // Returns its index in the ARRAYS vector.
+    assert(!"Implement me!");
+}
+
+array_h array_store(int old_array, int bv_key, int bv_value) {
+    // Construct a new array that is the same as old_array except bv_key is
+    // updated to bv_value. This should look like new_array() except you set
+    // array_parent, bv_store_key, bv_store_value correctly.
+    assert(!"Implement me!");
+}
+
+int array_get(int array, int bv_key, int out_width) {
+    // Create a new bitvector and (on the corresponding array) record this
+    // key/value pair lookup.
+    assert(!"Implement me!");
+}
+
+int new_bv(int width) {
+    // Create a fresh bitvector of the given width. Note here you need to
+    // append to the BVS vector as well as initialize all its bits to fresh SAT
+    // variables.
+    assert(!"Implement me!");
+}
+
+int const_bv(int64_t value, int width) {
+    // Like new_bv, except also add clauses asserting its bits are the same as
+    // those of @value. Please do little-endian; bits[0] should be value & 1.
+    assert(!"Implement me!");
+}
+
+int bv_eq(int bv_1, int bv_2) {
+    int width = BVS[bv_1].n_bits;
+    assert(width == BVS[bv_2].n_bits);
+
+    // This one is a doozy: add a fresh SAT variable and enough SAT clauses to
+    // assert that this variable is true iff all the bits of bv_1 and bv_2 are
+    // equal. I suggest using as many of intermediate/temp SAT variables as you
+    // need; generally, BCP does a great job at handling those so they're more
+    // or less "free" (at least, for our purposes rn!).
+    assert(!"Implement me!");
+}
+
+int bv_add(int bv_1, int bv_2) {
+    int width = BVS[bv_1].n_bits;
+    assert(width == BVS[bv_2].n_bits);
+
+    int out = new_bv(width);
+
+    // This one is similarly big; basically you want to do a ripple-carry
+    // adder to assign the bits of @out based on the bits of @bv_1, @bv_2. I
+    // would recommend just trying to encode it as a truth table.
+    assert(!"Implement me!");
+}
+
+static void array_axioms(struct array array, int compare_up_to,
+                         struct bv_lookup bv_lookup, int already_handled) {
+    if (array.array_parent >= 0) {
+        // This array is of the form (store parent k v), where k and v are
+        // array.bv_store_*. Add clauses here of the form:
+        // ((k == lookup.key) ^ !already_handled) => (store_val == bv_val)
+        // Hint: use bv_eq!
+        assert(!"Implement me!");
+
+        // Now we need to record if that worked or not. If this is picking up
+        // that store, then we don't need the value to match with any prior
+        // values (because they were overridden). So make a fresh SAT variable
+        // new_already_handled and assign it:
+        // (already_handled or (k == lookup.key)) <=> new_already_handled
+        assert(!"Implement me!");
+
+        // Then update already_handled = new_already_handled
+        assert(!"Implement me!");
+    }
+
+    // If we haven't found a set yet, compare lookup[key] to all prior lookups
+    // on the array. If keys eq, vals should be eq too.
+    for (int i = 0; i < compare_up_to; i++) {
+        struct bv_lookup sub_lookup = array.bv_lookups[i];
+        // We want:
+        // !already_handled ^ (sub_lookup.key == bv_lookup.key)
+        // => sub_lookup.value == bv_lookup.value
+        assert(!"Implement me!");
+    }
+
+    // If we haven't found a set yet, go back through parents and repeat this
+    // reasoning.
+    if (array.array_parent >= 0) {
+        array_axioms(ARRAYS[array.array_parent],
+                     ARRAYS[array.array_parent].n_bv_lookups,
+                     bv_lookup, already_handled);
+    }
+}
+
+int *SAT_SOLUTION = NULL;
+int solve() {
+    assert(!SAT_SOLUTION);
+    char constraint_path[256] = "";
+    sprintf(constraint_path, "temp_files/constraints.%d.dimacs", getpid());
+
+    FILE *fout = fopen(constraint_path, "w");
+
+    int zero = NEXT_SAT_VAR++;
+    clause(-zero);
+
+    // Go through array stores & lookups and set up the N^2 bv_eq implications
+    // implied
+    for (int i = 0; i < N_ARRAYS; i++) {
+        for (int j = 0; j < ARRAYS[i].n_bv_lookups; j++) {
+            array_axioms(ARRAYS[i], j, ARRAYS[i].bv_lookups[j], zero);
+        }
+    }
+
+    // Print the clauses and some info about the variable interpretations
+    for (int i = 0; i < N_BVS; i++) {
+        fprintf(fout, "c bitvector %d:", i);
+        for (int j = 0; j < BVS[i].n_bits; j++)
+            fprintf(fout, " %d", BVS[i].bits[j]);
+        fprintf(fout, "\n");
+    }
+    fprintf(fout, "p cnf %d %d\n", NEXT_SAT_VAR - 1, N_CLAUSES);
+
+    // Iterate through the clauses and print them in DIMACS format.
+    assert(!"Implement me!");
+
+    fclose(fout);
+
+    char result_path[256] = "";
+    sprintf(result_path, "temp_files/results.%d.txt", getpid());
+
+    char cmd[1024] = "";
+#ifdef USE_MINISAT
+    sprintf(cmd, "minisat %s %s > /dev/null", constraint_path, result_path);
+#else
+    sprintf(cmd, "cat %s | ./fast > %s", constraint_path, result_path);
+#endif
+    assert(system(cmd) != -1);
+
+    FILE *results = fopen(result_path, "r");
+    char sat_or_unsat[10] = "NONE";
+    assert(fscanf(results, "%s ", sat_or_unsat));
+    if (!strcmp(sat_or_unsat, "UNSAT")) {
+        fclose(results);
+        return 0;
+    } else if (!strcmp(sat_or_unsat, "SAT")) {
+        SAT_SOLUTION = malloc(NEXT_SAT_VAR * sizeof(SAT_SOLUTION[0]));
+        int literal = 0;
+        while (!feof(results) && fscanf(results, " %d ", &literal)) {
+            if (literal < 0)    SAT_SOLUTION[-literal] = 0;
+            else                SAT_SOLUTION[literal] = 1;
+        }
+        fclose(results);
+        return 1;
+    }
+    exit(1);
+}
+
+int64_t get_solution(int bv, int as_signed) {
+    assert(SAT_SOLUTION);
+
+    // Read the bits for @bv from SAT_SOLUTION into an int64_t.
+    assert(!"Implement me!");
+}
diff --git a/smt-symex/code/smt.h b/smt-symex/code/smt.h
new file mode 100644
index 0000000..678a02e
--- /dev/null
+++ b/smt-symex/code/smt.h
@@ -0,0 +1,48 @@
+#pragma once
+#include <stdint.h>
+
+typedef int bv_h; // handle to a bitvector
+typedef int sat_lit; // sat literal
+typedef int array_h; // handle to an array
+
+#define APPEND_GLOBAL(NAME) (*({ \
+    NAME = realloc(NAME, (++N_##NAME) * sizeof(NAME[0])); \
+    NAME + N_##NAME - 1; \
+}))
+
+#define APPEND_FIELD(OBJ, FIELD) (*({ \
+    (OBJ).FIELD = realloc((OBJ).FIELD, (++(OBJ).n_##FIELD) * sizeof((OBJ).FIELD[0])); \
+    (OBJ).FIELD + ((OBJ).n_##FIELD - 1); \
+}))
+
+// Create a fresh array and return a handle to it
+array_h new_array();
+// Return an updated array where the only difference is bv_key's value is now
+// bv_value
+array_h array_store(array_h old_array, bv_h bv_key, bv_h bv_value);
+// Return the value of bv_key in this array
+array_h array_get(array_h array, bv_h bv_key, int out_width);
+// Create a fresh bitvector of the given width and return a handle to it
+bv_h new_bv(int width);
+// Create a *constant* bitvector of the given width with the given value and
+// return a handle to it
+bv_h const_bv(int64_t value, int width);
+// Create a new literal/variable that is true iff the bitvectors represented by
+// handles bv_1 and bv_2 are equal.
+sat_lit bv_eq(bv_h bv_1, bv_h bv_2);
+// Return a handle to a fresh bitvector with value equal to the sum of that of
+// bv_1, bv_2.
+bv_h bv_add(bv_h bv_1, bv_h bv_2);
+
+// Add a clause/assertion to the underlying SAT instance. E.g., to assert that
+// bv1 and bv2 are handles to distinct bitvectors, use clause(-bv_eq(bv1, bv2))
+void clause_arr(sat_lit *literals);
+#define clause(...) clause_arr((sat_lit[]){__VA_ARGS__, 0})
+
+// Try to find a solution to the current set of constraints. Should only be
+// called once (per process). Returns 0 for unsat, 1 for sat. In the latter
+// case, the satisfying model can be queried with get_solution.
+int solve();
+// Query the model that satisfies the current constraints. Can only be called
+// once solve() is called and returns 1.
+int64_t get_solution(bv_h bv, int as_signed);
diff --git a/smt-symex/code/test_cases/test_arrays.c b/smt-symex/code/test_cases/test_arrays.c
new file mode 100644
index 0000000..6a0cef4
--- /dev/null
+++ b/smt-symex/code/test_cases/test_arrays.c
@@ -0,0 +1,67 @@
+#include <stdio.h>
+#include <assert.h>
+#include "smt.h"
+
+// Tests new_array, array_store, array_get
+int main() {
+    printf("Testing arrays...\n");
+
+    int bv_const_10 = const_bv(10, 10),
+        bv_const_0 = const_bv(0, 10),
+        bv_x = new_bv(10),
+        bv_y = new_bv(10);
+
+    int arr0 = new_array();
+    // Store x into 10
+    int arr1 = array_store(arr0, bv_const_10, bv_x);
+    // Store y into x
+    int arr2 = array_store(arr1, bv_x, bv_y);
+    // Store x into arr2[10]
+    int arr3 = array_store(arr2, array_get(arr2, bv_const_10, 10), bv_x);
+
+    // Assert x, y are both non-zero and both distinct
+    clause(-bv_eq(bv_x, bv_const_0));
+    clause(-bv_eq(bv_y, bv_const_0));
+    clause(-bv_eq(bv_x, bv_y));
+
+    // We're going to check the following facts:
+    // (1) arr1[10] == x
+    // (2) arr2[x] == y
+    // (3) if x == 10 then arr2[10] == y
+    //     else (arr2[10] == x && arr2[x] == y)
+    // (4) if x == 10 then (arr3[y] == x && arr3[x] == arr2[x]),
+    //     else arr3[x] == x
+    int arr1_10 = array_get(arr1, bv_const_10, 10),
+        arr2_x  = array_get(arr2, bv_x, 10),
+        arr2_10 = array_get(arr2, bv_const_10, 10),
+        arr3_y  = array_get(arr3, bv_y, 10),
+        arr3_x  = array_get(arr3, bv_x, 10);
+
+    // Should be sat...
+    assert(solve());
+
+    assert(get_solution(arr1_10, 0) == get_solution(bv_x, 0));
+    assert(get_solution(arr2_x, 0) == get_solution(bv_y, 0));
+    if (get_solution(bv_x, 0) == 10) {
+        assert(get_solution(arr2_10, 0) == get_solution(bv_y, 0));
+    } else {
+        assert(get_solution(arr2_10, 0) == get_solution(bv_x, 0));
+        assert(get_solution(arr2_x, 0) == get_solution(bv_y, 0));
+    }
+    if (get_solution(bv_x, 0) == 10) {
+        assert(get_solution(arr3_y, 0) == get_solution(bv_x, 0));
+        assert(get_solution(arr3_x, 0) == get_solution(arr2_x, 0));
+    } else {
+        assert(get_solution(arr3_x, 0) == get_solution(bv_x, 0));
+    }
+
+    // And the inequality & const clauses we added are important too...
+    assert(get_solution(bv_const_10, 0) == 10);
+    assert(get_solution(bv_const_0, 0)  == 0);
+    assert(get_solution(bv_x, 0) != get_solution(bv_y, 0));
+    assert(get_solution(bv_x, 0));
+    assert(get_solution(bv_y, 0));
+
+    printf("Passed!\n");
+    return 0;
+}
diff --git a/smt-symex/code/test_cases/test_basic_bv.c b/smt-symex/code/test_cases/test_basic_bv.c
new file mode 100644
index 0000000..6f6b922
--- /dev/null
+++ b/smt-symex/code/test_cases/test_basic_bv.c
@@ -0,0 +1,28 @@
+#include <stdio.h>
+#include <assert.h>
+#include "smt.h"
+
+// Tests new_bv, const_bv, bv_eq, solve, and get_solution.
+int main() {
+    printf("Testing basic bitvectors...\n");
+
+    int bv_const_10 = const_bv(10, 10);
+    int bv_x = new_bv(10),
+        bv_y = new_bv(10);
+
+    // assert x == const_10
+    clause(bv_eq(bv_x, bv_const_10));
+    // assert x == y
+    clause(bv_eq(bv_x, bv_y));
+
+    // Should be sat...
+    assert(solve());
+
+    // And everything should be 10!
+    assert(get_solution(bv_const_10, 0) == 10);
+    assert(get_solution(bv_x, 0)        == 10);
+    assert(get_solution(bv_y, 0)        == 10);
+
+    printf("Passed!\n");
+    return 0;
+}
diff --git a/smt-symex/code/test_cases/test_bv_add.c b/smt-symex/code/test_cases/test_bv_add.c
new file mode 100644
index 0000000..d73951a
--- /dev/null
+++ b/smt-symex/code/test_cases/test_bv_add.c
@@ -0,0 +1,29 @@
+#include <stdio.h>
+#include <assert.h>
+#include "smt.h"
+
+// Test all bitvector functions, including add. Uses the SMT solver to compute
+// subtraction, i.e., solve for x such that 11 + x = 23.
+int main() {
+    printf("Testing bitvectors with addition...\n");
+
+    int bv_const_11 = const_bv(11, 10),
+        bv_const_23 = const_bv(23, 10);
+    int bv_x = new_bv(10);
+
+    int bv_11_plus_x = bv_add(bv_const_11, bv_x);
+
+    // assert 11 + x == 23
+    clause(bv_eq(bv_11_plus_x, bv_const_23));
+
+    // Should be sat...
+    assert(solve());
+
+    // And 11 = 11, 23 = 23, and x = 12
+    assert(get_solution(bv_const_11, 0) == 11);
+    assert(get_solution(bv_const_23, 0) == 23);
+    assert(get_solution(bv_x, 0)        == 12);
+
+    printf("Passed!\n");
+    return 0;
+}
diff --git a/smt-symex/test_programs/test_memory b/smt-symex/test_programs/test_memory
new file mode 100644
index 0000000..0240094
--- /dev/null
+++ b/smt-symex/test_programs/test_memory
@@ -0,0 +1,6 @@
+load 0 0
+; assert(r0 == 10)
+immediate 1 10
+branch_eq 0 1 2
+fail
+exit
diff --git a/smt-symex/test_programs/test_prog b/smt-symex/test_programs/test_prog
new file mode 100644
index 0000000..73486c4
--- /dev/null
+++ b/smt-symex/test_programs/test_prog
@@ -0,0 +1,8 @@
+immediate 0 5
+immediate 1 2
+; r3 = (r1=2) + r2
+add 3 1 2
+; assert(r3 == 5)
+branch_eq 3 0 2
+exit
+fail
author	Matthew Sotoudeh <matthew@masot.net>	2023-06-03 14:51:16 -0700
committer	Matthew Sotoudeh <matthew@masot.net>	2023-06-03 14:51:16 -0700
commit	d8c82a8431b10de05ce5921d98fa7ec60ecf5df0 (patch)
tree	dcebb9ba5964e11060e0659657844db1683bed5d /smt-symex