static-analysis/DEBUGGING


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118

Some hints for debugging the compiler, etc. (credit to Eli for writing some of
these down!)

---------

If you find that your output includes a lot of `error <path>` statements:
1. Choose one of the paths that was produced
2. Run that path through `python3 helper_scripts/check_file.py <path> --debug`
3. Review the output in `/tmp/hi.prog`

From here, either run this output through the `main` routine from the prelab,
or look to see if there's a branch to some label that doesn't exist in your
IR.

If the latter is the case, you can actually just look up that line number in
the Linux (or whatever repo you're analyzing) source and see if there's some
statement (like an `if` statement or `while` loop) that's being parsed by your
compiler, then debug your compiler from there.

---------

if you find a .c file that causes an error or a false positive/negative,
delete half the file and use check_file.py to check if the undesirable
behavior is still there. repeat until you have a minimal input that triggers
the behavior you want to fix. also simplify lines aggressively, etc. (called
delta debugging). extra points if you use creduce :)

if you're not getting thousands of errors on the kernel, but it's also not
finding things (low true positive) you can take a look at the output from my
checker and try fiddling with your code until it can find those bugs on those
files (use python3 helper_scripts/check_file.py).

when in doubt, use asan! (but turn it off before running check_repo or
check_file)

in the prelab *_exprmap methods, make sure you're allocating space for the new
exprmap's exprs and deref_labels fields

in the prelab visit(), double-check that the assignment derefed =
instr->always_derefed is happening in a reasonable place; I moved it around in
the starter code after committing and this caused some people's prelab to get
messed up after a pull & merge (sorry :( )

in the prelab visit(), you can return once "instr->visited &&
subset_exprmap(instr->always_derefed, derefed)". One way to think about why
this is an OK time to return: you're going to update instr->always_derefed to
the intersection, but if it's a subset then the intersection is just the old
value of instr->always_derefed. so you won't learn anything new by continuing
this path. (Note my comment about this in the code is arguably
wrong/ambiguous, sorry! tho most people seemed to get it right regardless)

if you're getting a bunch of errors when running on the kernel, make sure the
target labels for every branch instruction output by your compiler actually
exist as the label of some other instruction in the IR. This can occur, e.g.,
if your implementation of "if" branches to an "else_" label that only actually
gets output to the IR if there is an else block (think about what happens if
there's no else block). Missing targets will cause the prelab checker to
silently exit(1) ... sorry should have printed an error message there ...

also make sure in visit_stmt that you always (1) call visit_expr on the return
value's expression and (2) recurse on the remaining statements in the range
(after the return statement). for this you'll have to use the "find" method in
utils.c to find the semicolon.

extension note: Tina got chatgpt to filter true vs. false positives. I think
she's going to post about that later

extension idea: I was talking to Manya about this; note you can pretty
reasonably turn our compiler into an "actual" compiler that spits out, say,
(bad) x86 assembly or Python or something (or write your own IR!). if you want
to keep the basic structure we're using, you pretty much need to extend the
struct meta to also include "what register should I put the output?" --- lmk
if you're interested in this, I can give some pointers.

---------

modify helper_script/lexer_tests.c to print out all the lexemes from
LEXEMES[0] up to LEXEMES[N_LEXEMES - 1] before you do any assertions. That
should quickly help narrow down what rules you're implementing wrong, at least
for the simple test cases in lexer_tests.c.

for the lexer test on the big file: it should be pretty clear from the diff
what the problem is; here again, delete lines & characters until you have a
tiny minimal example that triggers the bug in your lexer, then
mentally/on-paper step through the lexer's operation on that file until you
notice the bug.

and for writing the compiler in the first place:

always first convert to a "goto program" that's halfway between C and our IR.
for example, a general if/else pattern in C looks like:

    if (cond) then_body; else else_body;

you might write it as a "goto program" like so:

    branch <cond> then_label else_label;
    then_label:
    <then_body>
    goto exit
    else_label:
    <else_body>
    exit_label:

you can then translate "goto programs" pretty much line-by-line into code you
can stick into the compiler (note you need to construct metas correctly):

    visit_expr(<cond>, {.true = then_label, .false = else_label})
    nop_labeled(then_label)
    visit_stmt(<then_body>)
    goto_(exit)
    nop_labeled(else_label)
    visit_stmt(<else_body>)
    nop_labeled(exit_label)

you should be able to handle pretty much every statement compilation rule like
this: always write out the goto program first, and then translate *that*
line-by-line into calls to visit_expr, visit_stmt, goto_, and nop_labeled.