Hello, this is an ongoing personal learning series on Large Language Models (LLMs) and automated reverse engineering. In a previous blog post, I described a type of complexity attack against LLMs. I am using "attack" in the practical reverse-engineering sense: intentionally increasing the amount of interdependent code and state the model has to reason about. My hypothesis is that increasing the number of interdependent round functions increases the chance that a model will make a small but fatal translation error, even when it identifies the correct high-level algorithm. Here is an excerpt from that blog post that describes the methodology I'm using.

The complexity attack increases computational complexity by generating binaries with a large number of interdependent functions. Instead of hiding the logic, the goal is to make the amount of state and code too large for practical static reasoning. The executable contains a toy XOR cipher with a keystream derived from a set of N functions, where N is the number of generated rounds. A Python script generates C source code with an embedded encrypted string and decryption loop. GCC is then used to compile the C source into an executable. At runtime, the decrypted string is printed to the console. To make this concrete, we can walk through generating the code and compiling it.

What is useful about this approach is that I can measure the complexity of the generated binaries and then test them against an LLM. At first glance, this sounds easy, but there are a lot of nuances to running models locally. For example, one model started to fail after I updated Ollama. In this blog post, I'm going to describe the results, themes of failures and lessons learned. Before diving into those, I think it's worth mentioning the hardware and tooling.

I'm running all of the local models on a DGX Spark and using OpenAI's Codex as an agent. On the Spark, I use Python and Bash for all of the scripting and tool execution, clearbluejar's pyghidra-mcp within Docker to interact with Ghidra, and Ollama to interact with models. I'm using the following models for evaluation:

gemma4:26b
qwen3.6:35b
qwen3-coder:30b
mistral-small3.2:24b
command-r:35b
cogito:32b
hermes3:8b

I attempted to download and use a number of other models, but they did not work because they were not accessible via MCP, timed out, or had other similar issues. Before I dig too deep into the evaluation process and data, I wanted to share the results of the most recent evaluation. In these runs, gemma4:31b is clearly the best local baseline. gemma4:26b was historically successful, but it became unstable after updating Ollama. qwen3.6:35b was able to solve fixtures (e.g. binary test case) 0001 and 0002 after updating the prompt. Below is a table of the results from 102 runs. Please note that the results improved over time with updates to the prompt, so this should be read as a practical lab notebook rather than a perfectly controlled benchmark.

model	rows	verified	completed false	timeouts	other inconclusive	successful fixtures
`gemma4:31b`	15	11	2	2	0	`0001`, `0002`, `0003`, `0004` across later runs
`gemma4:26b`	23	10	8	3	2	mostly `0001`, plus `0002` and `0003`
`qwen3.6:35b`	11	2	6	1	2	`0001`, `0002`
`qwen3-coder:30b`	9	0	8	0	1	none
`qwen3.5:35b`	4	0	2	1	1	none
`cogito:32b`	9	0	8	0	1	none
`command-r:35b`	9	0	8	0	1	none
`mistral-small3.2:24b`	9	0	8	0	1	none
`hermes3:8b`	9	0	8	0	1	none
`lfm2:24b`	4	0	4	0	0	none

The following table is from the last run in which gemma4:31b was able to successfully write a decryptor for binaries with 1, 2, and 3 rounds of XOR functions.

model	fixture	rounds	final grade	completed	decryptor correct	py blocks	status	elapsed sec	MCP calls	error
gemma4:31b	/fx_r0001_sl0016_sp0000_dwarf.exe-2f588b	0001	true	True	True	1	completed	958.73	8
gemma4:31b	/fx_r0002_sl0016_sp0000_dwarf.exe-861fc4	0002	true	True	True	1	completed	1101.385	11
gemma4:31b	/fx_r0003_sl0016_sp0000_dwarf.exe-135f91	0003	true	True	True	1	completed	918.604	9
gemma4:31b	/fx_r0004_sl0016_sp0000_dwarf.exe-9f4750	0004	inconclusive_timeout	False	False	0	ollama_timeout	2287.903	10	TimeoutError: timed out
qwen3.6:35b	/fx_r0001_sl0016_sp0000_dwarf.exe-2f588b	0001	true	True	True	2	completed	414.498	23
qwen3.6:35b	/fx_r0002_sl0016_sp0000_dwarf.exe-861fc4	0002	true	True	True	4	completed	281.356	13
qwen3.6:35b	/fx_r0003_sl0016_sp0000_dwarf.exe-135f91	0003	false	True	False	1	completed	304.465	29

The analysis of the 4-round binary timed out after 1800 seconds (30 minutes). Previous runs were able to decrypt 4-round binaries, but not 5-round binaries. There is evidence in this setup that the models start to degrade as the binary becomes more complex, but the results are still inconclusive.

Failures

Disclaimer: The failure report was generated using AI. I personally find the failures fascinating.

The failures cluster around a small set of reverse-engineering translation hazards:

uint32_t + uint32_t wrapping before a later uint64_t cast
C unsigned literal widths, especially constants ending in u
C cast timing, such as (uint64_t)(expr) where expr was already evaluated at 32 bits
CONCAT44(a,b) high/low argument interpretation
C operator precedence involving +, ^, |, <<, and >>
64-bit rotate idioms such as (x << 21) | (x >> 43)
final derive_state accumulator behavior and final xorshift32(...)
preserving byte order from mixed character/hex array initializers
implementing every generated round function in order

The most interesting lesson so far is that the models often do the hard-looking part first. They find the right functions, recover the encrypted bytes, and understand the XOR loop. The thing that breaks them is often much smaller: one C integer-width rule, one cast at the wrong time, or one generated round translated almost-but-not-quite correctly.

1. The Hardest Failures Are Now Translation Errors, Not Tool Access

The strongest models usually find the right region of the binary: main, derive_state, xorshift32, encrypted bytes, and the generated round functions.

The most important remaining failure class is translating C/Ghidra semantics into Python exactly. This shows up as scripts that look plausible, run successfully, but print non-plaintext bytes.

Note: 20260611-** is the id of the run

Confirmed examples:

gemma4:31b, 20260611-aa, fixture 0004
- The model recovered the right functions and wrote a full decryptor.
- Failure was in R3: it translated (uint64_t)(0xc8e57b40u + m) as a wide Python addition instead of u64(u32(0xc8e57b40 + m)).
- One-line fix made the script print Hello, World.
gemma4:31b, 20260611-cc, fixture 0005
- Same root cause in R3.
- Correct C: p->b ^= (uint64_t)(0xc8e57b40u + m);
- Model Python missed the 32-bit wrap before widening.
- One-line fix made the script print Hello, World.

This is best classified as incorrect-reasoning with a bad-type-recovery or cast-timing secondary cause.

2. Prompt Updates Improved `gemma4:31b`, but Did Not Fully Solve Cast Timing

gemma4:31b improved materially after prompt changes. In 20260611-cc, it solved fixtures 0001 through 0004 and failed on 0005. Earlier, in 20260611-aa, it solved 0001 through 0003 and failed 0004.

The remaining 0005 failure shows that telling the model to use ctypes is not enough. The model imported or mentioned fixed-width behavior but still used raw Python arithmetic for a critical intermediate expression.

The prompt now needs to force helper usage, not just mention ctypes:

def u32(x): return ctypes.c_uint32(x).value
def u64(x): return ctypes.c_uint64(x).value

Critical translation rule:

(uint64_t)(a_uint32 + b_uint32) -> u64(u32(a + b))

This is the kind of bug that makes the benchmark useful to me. The model is not completely lost, but it is also not correct. That middle zone is where a lot of reverse-engineering automation gets interesting.

3. `gemma4:31b` Is the Current Best Local Baseline

gemma4:31b has the strongest completed results:

20260611-aa: solved 0001, 0002, 0003; failed 0004.
20260611-cc: solved 0001, 0002, 0003, 0004; failed 0005.
20260611-dd: solved 0001, 0002, 0003; timed out on 0004.

It is slow, but its failures are now narrow and mechanically diagnosable.

4. `gemma4:26b` Is Historically Useful but Unstable

gemma4:26b has repeated successes, mostly on fixture 0001, and solved 0001 through 0003 in 20260611-bb.

However, it also regressed repeatedly:

wrong decryptor output on fixture 0001 in some runs
timeouts on early fixtures
empty or pseudo-tool response on fixture 0002
inconsistent ability to proceed past fixture 0001

It remains useful as a historical baseline, but it is not as reliable as gemma4:31b.

5. `qwen3.6:35b` Became Interesting in Later Runs

Earlier qwen3.6:35b runs mostly produced invalid or non-printing Python, timed out, or failed to converge.

In 20260611-dd, it solved fixtures 0001 and 0002, then failed fixture 0003 with invalid/malformed Python. That suggests it is not merely MCP-compatible; it can solve the simpler generated fixtures under some settings. It still degrades as round complexity increases.

6. Many Models Are Tool-Compatible but Do Not Converge

Several models can call MCP tools but fail to produce a usable decryptor:

command-r:35b
lfm2:24b
some qwen3.6:35b and qwen3.5:35b runs

These failures usually are not MCP server failures. They are either:

tool use without a final algorithm
excessive tool loops
target drift
loss of the objective after accumulating tool output

lfm2:24b is the clearest example: it used many MCP calls in some runs but did not produce a Python decryptor.

7. Smaller Models Often Stop Too Early

The most common pattern for mistral-small3.2:24b and hermes3:8b is minimal MCP usage followed by no decryptor.

These are mostly incomplete-xrefs failures:

did not inspect enough of main
did not inspect derive_state
did not inspect all generated round functions
did not recover the encrypted bytes and loop bounds

8. Invalid Python and Fence Extraction Remain Separate Problems

Some failures are model output quality problems:

invalid Python syntax
Markdown/prose inside extracted code
several fenced blocks, none of which are a clean decryptor
claimed plaintext inconsistent with script output

There is also a harness extraction issue observed in 20260611-cc for gemma4:31b fixture 0005: the saved block_0.py was not the explicit Python block. It captured Markdown around the recovered-output prose because earlier c fences confused the plain-fence extractor. The actual Python block in answer.md ran, but printed wrong bytes until the R3 cast-timing bug was fixed.

This means two separate checks are needed:

Did the model write a correct Python decryptor?
Did the extraction harness capture the intended Python block?

Back to non-AI-ish text.

Limitations

There are a few caveats worth calling out before the conclusion. The prompt changed during the study, and some of the later improvements came from manually comparing generated Python against the original C and feeding that back into the prompt. Ollama updates may also have changed model behavior, especially for gemma4:26b. There was also at least one harness extraction issue where the saved Python block was not the intended final decryptor. Finally, timeouts are inconclusive. They show that the run did not finish inside the configured timeout, not that the model could never solve the fixture.

That means the results are best read as evidence from one local setup, not a universal ranking of models or a final statement about LLM reverse-engineering ability.

Conclusion

More rounds do not make solving impossible, but they substantially increase the chance of failure once a model must preserve a longer state mutation chain.

The dominant complexity effect is not discovering the high-level XOR scheme. Models often find:

main
encrypted bytes
seed
derive_state
xorshift32
generated round functions

The failure emerges when translating every round exactly:

more generated functions to implement
more mutable state updates to preserve in order
more C integer width boundaries
more unsigned literal/cast timing traps
more rotate and precedence opportunities for one-bit state errors

For this fixture generator, prompt, harness, and local model setup, the practical threshold appears to be:

rounds 1-3: solvable by gemma4:31b with good reliability
round 4: boundary where failures/timeouts start
round 5: current failure point for gemma4:31b

That conclusion should be treated as provisional because there are few high-round samples.

Next Steps

Automated prompt improvement using failure analysis
- This process was done manually by comparing the generated Python code against the original C code, then having the harness recommend upgrades to the prompt.
- This substantially improved the results. qwen3.6 started working after the first iteration of this approach.
Compare against frontier models.
Keep learning and keep reading.

Any recommendations? I would love to hear feedback, hints or thoughts from others. Feel free to send me an email to Alexander dot Hanel at gmail dot com or leave a comment. Cheers.

Hooked on Mnemonics Worked for Me

Stressing LLMs - Local Model Complexity Attacks Progress

Failures

1. The Hardest Failures Are Now Translation Errors, Not Tool Access

2. Prompt Updates Improved `gemma4:31b`, but Did Not Fully Solve Cast Timing

3. `gemma4:31b` Is the Current Best Local Baseline

4. `gemma4:26b` Is Historically Useful but Unstable

5. `qwen3.6:35b` Became Interesting in Later Runs

6. Many Models Are Tool-Compatible but Do Not Converge

7. Smaller Models Often Stop Too Early

8. Invalid Python and Fence Extraction Remain Separate Problems

Limitations

Conclusion

Next Steps

No comments:

Post a Comment

Pages

About Me

Hooked on Mnemonics Worked for Me

Stressing LLMs - Local Model Complexity Attacks Progress

Failures

1. The Hardest Failures Are Now Translation Errors, Not Tool Access

2. Prompt Updates Improved gemma4:31b, but Did Not Fully Solve Cast Timing

3. gemma4:31b Is the Current Best Local Baseline

4. gemma4:26b Is Historically Useful but Unstable

5. qwen3.6:35b Became Interesting in Later Runs

6. Many Models Are Tool-Compatible but Do Not Converge

7. Smaller Models Often Stop Too Early

8. Invalid Python and Fence Extraction Remain Separate Problems

Limitations

Conclusion

Next Steps

No comments:

Post a Comment

Pages

About Me

2. Prompt Updates Improved `gemma4:31b`, but Did Not Fully Solve Cast Timing

3. `gemma4:31b` Is the Current Best Local Baseline

4. `gemma4:26b` Is Historically Useful but Unstable

5. `qwen3.6:35b` Became Interesting in Later Runs