Attention-Only vs. All-Layer LoRA: A Worklog

This is the worklog version of the LoRA and Friends experiment.

The final public artifacts are:

1. Before The Run

2026-04-19: the first plan had too many moving parts

I started with too many possible experiments in the air: full fine-tuning, MoE, RL, transfer evaluation, Tinker defaults, LoRA target modules, dataset choice, benchmark choice, and budget. That was not a runnable project but a pile of directions.

I removed most of it. Phase one became:

I kept Qwen3-8B as the working model. Qwen3-8B-Base stayed in reserve for a different post-training story, and a smaller future model would have added a model-choice variable that I did not want in the first comparison.

The target-module comparison was the part I wanted to isolate:

conditionintended adapter scope
attention_onlyattention projections
all_layerattention and MLP projections

That meant everything else had to become boring on purpose: same base model, same data, same rank, same training schedule, same evaluation prompt, same answer extraction, same benchmark, and same checkpoint-selection rule.

Why GSM8K became the benchmark, not the main training set

GSM8K was tempting as a training source because it is clean and familiar. But the train split has only 7,473 rows, and the test split has 1,319 rows. I wanted the final score to be held out and easy to explain, so I kept GSM8K as the benchmark anchor.

The training source moved to nvidia/OpenMathInstruct-2, especially the train_1M split. The schema was simple enough to render into chat format:

{
	"problem": "Solve for $y$:\n\n$$\\frac{y^2 - 3y + 2}{y - 2} = y + 1$$",
	"generated_solution": "... \\[ y = \\boxed{2} \\]",
	"expected_answer": "2",
	"problem_source": "augmented_math"
}

I also looked at unsloth/OpenMathReasoning-mini. It had more character and much longer traces, but the token profile made it harder to fit into the budget. OpenMathInstruct-2 looked less exciting, but it had the right shape for a controlled SFT comparison.

The first sizing pass made the budget concrete. A 500-row tokenizer sample from OpenMathInstruct-2 averaged 456.9 rendered tokens per example, with p90 = 813 and p95 = 963. At Tinker’s then-current Qwen3-8B training price of $0.40 / M tokens, a 30k-example run started to look feasible.

The first target was a mildly balanced 30k dataset:

30k working raw dataset
- 21,000  augmented_math
-  7,000  augmented_gsm8k
-  1,000  math
-  1,000  gsm8k

The first plan did not survive.

The first run sheet was useful even though it was wrong

The first run sheet had a training split of 27,000 rows and a validation split of 3,000 rows. With the early weighted mean estimate, that came out to about 11.70M train tokens per epoch and about $4.68 per epoch.

The pilot shape was also already there:

pilot sweep
- train rows: 5,000
- val rows:     500
- 2 conditions x 3 LR values x 1 seed x 1 epoch

And the main comparison shape was visible:

thesis comparison
- 2 conditions x 3 seeds x 2 epochs
- total train cost: about $56.17

Those numbers were later superseded, but they were still useful. They forced me to think in runs, seeds, rows, tokens, and dollars instead of treating “fine-tune a model” as one vague action.

The benchmark side looked cheap in comparison. One full GSM8K eval was estimated around cents, not dollars of training cost. So from the beginning, training data and training schedule were the real budget levers.

2. Before Training

2026-04-20: the first dataset looked usable until I read it

The initial openmath_30k artifacts matched the recipe exactly: 27,000 train rows and 3,000 validation rows. Structurally it looked clean. Every inspected training row had a final \boxed{...} answer. The first rendered dataset under qwen3_disable_thinking also looked reasonable.

Then the manual audit started doing its job.

One sampled augmented_gsm8k row reasoned itself into clipping a percentage answer to 100 because a fictional budget was too small. A tetrahedron row noticed a fractional tetrahedron count was impossible and still pushed through to a final ratio. A pattern search found disclaimer-like repair language:

cannot spend more than she has
doesn't align with the logical outcome
we need to set the value to 100

At first I treated this as a light cleanup problem. I removed a couple of flagged rows, built filtered artifacts, and added manual review packs. The mismatch count fell, the obvious bad rows got smaller, and the dataset felt closer.

But each pass exposed another class of issue.

The augmented branch kept asking for another pass

The suspicious rows were concentrated in the augmented sources. The failures were not all the same, but they had a family resemblance: impossible integer counts, prompt-generation residue, unsupported case-bashing, and answers that looked right only because the generated reasoning had quietly changed the problem.

The review packs were small enough to inspect but large enough to change my mind. A 53-row suspicious train pack for curated_v2 produced 29 removals. A later 61-row suspicious train pack produced 30 hard removals, all from augmented_math. These were not tiny formatting mismatches but rows with broken reasoning, contaminated prompts, or invalid proofs.

The curation sequence became longer than I wanted:

One excerpt from the log captures the real issue:

The problem was no longer "find one more regex,"
it was "stop trusting augmented_math as the backbone of the dataset."

At that point I stopped trying to rescue the 30k augmented-heavy recipe. If the dataset takes repeated repair passes and still produces obvious bad rows in random review, then it stops being a clean experimental input and becomes the experiment itself.

I did not want the final write-up to be “attention-only LoRA versus all-layer LoRA, but also maybe the augmented data had hidden garbage.” That would have made every downstream result harder to trust.

2026-04-20 to 2026-04-21: I rebuilt around original rows only

OpenMathInstruct-2 train_1M had 29,468 original-source rows across gsm8k and math. After the strict gate, 28,166 survived cleanly enough to build a new candidate.

The accepted source counts were:

{
	"gsm8k": 14618,
	"math": 13548
}

The original-only dataset was smaller than the 30k target, but the quality profile was much better. It cleared the automatic checks that the augmented branch kept failing:

I froze this path as openmath_original_clean.

After a later split repair, the final retained manifest became:

splitgsm8k source rowsmath source rowstotal
train13,14512,20325,348
validation1,4731,3452,818

The old openmath_30k* lineage was retired after that. I wanted one clear dataset path in the repo, not a dozen stale artifact branches that future-me or another agent might accidentally treat as active.

2026-04-21: I kept the system prompt

The render sanity check compared the frozen dataset with and without the system prompt under the actual Qwen/Qwen3-8B chat template. The prompt added a constant 27 tokens per example.

{
	"full_mean_with_system": 340.25,
	"full_mean_without_system": 313.25,
	"full_mean_delta": 27.0
}

That overhead was small enough to keep. More importantly, the prompt made the output contract explicit:

You are a careful math solver. Solve the problem step by step. Put the final answer in \boxed{}.

I kept it because the raw problems themselves do not always carry the boxed-answer instruction. I also wanted training and evaluation to mirror each other as closely as possible.

3. Contracts And Sanity Checks

2026-04-22: I stopped relying on memory

Once the dataset path was stable, the next risk was format drift. It is easy to train with one prompt, evaluate with another, parse answers a third way, and then accidentally report a prompt experiment as a model experiment.

So I wrote the contracts down before the final result existed.

I also had to clean up the planning docs. Some TODO labels were too compressed. A label like “contamination check” made sense to me while I was in the middle of the work, but it hid the actual operation: compare training-side gsm8k problem text against held-out GSM8K test questions after canonical normalization, then block training if overlap is greater than zero.

So I added an execution-clarity rule for the project: open tasks needed to say what the action was, why it mattered, what counted as done, and what happened if it failed. It sounds procedural, but it kept the later runbook from turning into shorthand only I could decode.

The evaluation contract froze:

The contamination check was a gate, not a note. If training-side gsm8k rows overlapped the GSM8K test questions, the benchmark had to change or the dataset had to be rebuilt.

The result schema was frozen around retained JSONL output under artifacts/results/<run_id>/. It also kept token_count and cost in the canonical schema even when those values were null, because I did not want the schema to change later just because telemetry became available.

I also standardized on the word condition for the comparison label, which sounds minor but made later artifacts easier to read:

{
	"condition": "attention_only",
	"seed": 0,
	"step": 3169
}

2026-05-03: the dataset split still had a hidden problem

When I ran the dataset-integrity gate, the benchmark side passed. The training-side gsm8k rows had 0 overlap with the openai/gsm8k test split.

But the local train/validation split failed a different check. OpenMath can include multiple accepted solutions for the same underlying problem. The original builder had split rows independently, so variants of the same problem could land on both sides.

That would make validation loss too friendly. It would also make checkpoint or LR decisions look cleaner than they were.

The fix was to group by canonical problem text before splitting. The retained report after the rebuild says:

{
	"status": "pass",
	"overlap_count": 0,
	"train_val_overlap": {
		"row_id_overlap_count": 0,
		"problem_text_overlap_count": 0,
		"overlapped_val_row_count": 0
	}
}

The final contamination report compared 13,145 training rows with source == "gsm8k" against 1,319 GSM8K test questions. It found overlap_count = 0.

The raw files were not enough on their own, and the split rule mattered.

2026-05-04: the dataset went to Hugging Face

After the repaired split, the frozen dataset payloads were published to sumitdotml/lora-and-friends-dataset.

The local repo kept scripts, manifests, checksums, and audit evidence. The Hub repo carried the payloads:

payloadrowsbytes
raw train25,34825,528,287
raw validation2,8182,787,397
rendered train25,34829,989,535
rendered validation2,8183,283,365

At this point I had a real dataset, a render path, an evaluation contract, and a contamination gate. I still did not know whether the training path would work.

4. First Tinker Check

2026-05-04: the first backend check caught a renderer trap

The first Tinker smoke pass was supposed to be boring. It was one of the smallest checks in the whole project: can Tinker accept Qwen/Qwen3-8B, rank 8, and the two LoRA condition switches?

The Tinker cookbook qwen3_disable_thinking renderer was correct for generation prompts, but its supervised-training path did not match the frozen SFT render contract. The project contract was the Hugging Face chat template with enable_thinking=False, which renders an empty thinking block before the assistant answer:

<think>

</think>

I changed the smoke runner so Tinker datums came from AutoTokenizer.apply_chat_template(..., enable_thinking=False), then masked loss only after the rendered prompt prefix.

The retained smoke pass then gave me enough to lock the adapter defaults:

conditiontrain rows in smokeoptimizer stepsvalidation NLL
attention_only111.5048651695251465
all_layer811.4733978509902954

Tinker accepted:

{
	"attention_only": {
		"train_attn": true,
		"train_mlp": false,
		"train_unembed": false
	},
	"all_layer": {
		"train_attn": true,
		"train_mlp": true,
		"train_unembed": false
	}
}

The public Tinker API did not expose local lora_alpha or lora_dropout fields through this training path. I recorded those as backend-owned instead of pretending the runner controlled them.

There was one more small backend detail in the log: the first smoke attempt printed that the Tinker SDK version was outdated. I upgraded to tinker==0.18.2, reran the smoke pass, and the warning did not reappear. Backend version drift becomes hard to reconstruct later, so I kept the note.

The smoke pass also created checkpoint paths with a seven-day TTL. That did not matter for the final result yet, but it foreshadowed the export urgency after the main sweep.

2026-05-05: I ran the untouched baseline before touching LoRA

The baseline was Qwen/Qwen3-8B on openai/gsm8k, config main, split test, with enable_thinking=false, temperature=0, and max_new_tokens=512.

The retained baseline run was:

artifacts/results/baseline-qwen3-8b-gsm8k-001/

It scored:

I wanted this number before LR selection and before the main comparison. Otherwise, the final LoRA scores would float without a reference.

5. Learning Rate And Batch Shape

2026-05-06: the LR-selection plan was too slow

The first LR-selection protocol used 5,000 train rows and 500 validation rows. On paper it was reasonable. In practice it projected to roughly 20 hours for the six-run sweep on Tinker.

I rescaled it to a deterministic slice:

The narrow purpose was to choose a peak LR per condition, not to run a final benchmark or a condition comparison.

The selected LR was 3e-4 for both:

conditionLRbest validation NLL
attention_only1e-40.3786645046540731
attention_only3e-40.3632619345728878
attention_only1e-30.3644437038722405
all_layer1e-40.3648794147648033
all_layer3e-40.3559855057286731
all_layer1e-30.37397296784836664

I kept seed 7 reserved for selection work so the main comparison seeds could be 0, 1, and 2.

The LR-selection runner also caught a plain implementation bug before the expensive runs. I had assumed Tinker cookbook datum weights were raw torch tensors. They were TensorData objects. The live one-step probe failed, the runner got fixed, and that mistake went into AGENT_MISTAKES.md. The final result depends on the boring part where a runner can survive a one-step live probe before it is trusted for a sweep.

I also split the training helper code into smaller modules:

The runnable scripts stayed separate: one for the smoke pass, one for LR selection, and later one for the main training sweep. This kept the orchestration scripts readable enough that I could still inspect what was being frozen.

2026-05-07: speed was tempting, but batch size was not free

The initial runner shape used eight one-datum train requests before each optimizer step. A throughput probe showed that request shape mattered a lot at the same effective batch size:

Throughput by request shape

request shapeseconds per optimizer step
single_datum_calls20.22187466151081
batched_datums5.201307859155349
batched_datums_pipelined2.4104866901249693

The fast version, batched_datums_pipelined, submits one batched forward_backward_async(...) request and queues the optimizer request before waiting. That kept the nominal effective batch size at 8; it only changed how the same batch was sent to Tinker.

Then larger batches looked tempting. Batch 512 and 1024 were much faster in throughput probes. But batch size is a real hyperparameter for LoRA, so I made the larger-batch path earn its way in by validation NLL, and it did not.

conditionselected batch/LRselected NLLbest larger-batch candidatecandidate NLL
attention_onlybatch 8, 3e-40.3632619345728878batch 512, 1e-30.37616809419132946
all_layerbatch 8, 3e-40.3559855057286731batch 512, 1e-30.3569802998485914

So the main run kept effective batch size 8, used the pipelined request shape, and kept the final partial batch of 4 rows rather than wrapping around and duplicating examples. The fast path was real, but it was only allowed to change operations, not silently change the scientific comparison.

6. The Six Runs

2026-05-08: main-001 started

The main run expanded to six sequential condition/seed runs:

The launch command was:

uv run training/run_main_training.py --run-prefix main-001

Each run used:

The validation/checkpoint cadence was 1000, 2000, 3169, 4000, 5000, 6000, and 6338.

I monitored this like a long-running job, because it was one. The raw log has many repeated entries that look boring at first glance:

validation step=1000 nll=...
validation step=2000 nll=...
validation step=3169 nll=...
...

Those entries showed that the run was still alive, that validation/checkpoint blocks resumed correctly, and that every retained checkpoint path was read from the active run rather than guessed from a sibling run.

I had already made that mistake once while logging a checkpoint URL by analogy from a sibling run. The fix was simple: read the exact checkpoint string from the active run’s metrics.jsonl or summary before writing it down. After that, I treated checkpoint paths as evidence, not as strings I could reconstruct from memory.

Step 3169 kept showing up

The first attention-only seed hit:

main-001-attention_only-seed-0: validation step=3169 nll=0.336164

Its final summary selected step 3169, not the final step 6338:

selected best validation step: 3169
selected validation_mean_nll: 0.33616363178874076

Then seed 1 selected step 3169, then seed 2, then all-layer seed 0, then all-layer seed 1, then all-layer seed 2.

At the end, every run selected the one-epoch boundary:

runselected stepvalidation NLL
main-001-attention_only-seed-031690.33616363178874076
main-001-attention_only-seed-131690.33621202263413363
main-001-attention_only-seed-231690.33651007850933906
main-001-all_layer-seed-031690.33611938013674597
main-001-all_layer-seed-131690.33655583715915666
main-001-all_layer-seed-231690.33634131648081267

Validation NLL by training step

The selected NLL means were close:

But the post-3169 behavior was different. Attention-only stayed near the selected NLL. All-layer jumped more sharply after the one-epoch boundary; at step 4000, all-layer mean NLL was 0.3457733619534809.

I did not treat that as the final result though; it was a training diagnostic, and the actual comparison still had to come from GSM8K.

For all-layer seed 1, the NLL moved from 0.33655583715915666 at step 3169 to 0.34549824556309505 at step 4000. For all-layer seed 2, it moved from 0.33634131648081267 to 0.3461580484910246.

The lowest-NLL checkpoint rule prevented later checkpoints from becoming attractive simply because they were later or because the run had continued longer.

The sweep finished with six pass summaries

The final all-layer seed completed on 2026-05-09. The log entry was satisfying because the sweep had finally become a finite object:

completion artifacts now exist in artifacts/results/main-001-all_layer-seed-2/:
manifest.json, metrics.jsonl, sample_render.txt, summary.json

Each of the six selected checkpoints had a seven-day Tinker TTL, so the next step was urgent enough: export the selected checkpoint states into something that could be evaluated and published.

7. Checkpoint Export

2026-05-09: training weights were not sampler weights

The selected training checkpoints used weights/... paths. I initially wanted to use those directly for publication/evaluation, but the sampling client rejected that shape.

The actual server response was:

tinker.BadRequestError: Error code: 400 - {'detail': 'model_path must point to a sampler_weights checkpoint, got weights'}

So the selected training state checkpoints had to be converted to sampler-format weights before any evaluation could run.

The working export path was:

load_state_async(training_checkpoint_path)
save_weights_for_sampler_async(export_name)

The six sampler checkpoint URIs then looked like:

tinker://.../sampler_weights/export-main-001-attention_only-seed-0-step-3169
tinker://.../sampler_weights/export-main-001-all_layer-seed-2-step-3169

After that, the converted adapters were uploaded to the Hugging Face model repo. The published layout is:

checkpoints/best-checkpoints/attention_only/seed-{0,1,2}/step-3169/
checkpoints/best-checkpoints/all_layer/seed-{0,1,2}/step-3169/

The sampler weights also made the adapter-size difference concrete. The attention-only sampler exports were about 30.8M bytes each in the Tinker checkpoint listing, while the all-layer sampler exports were about 87.5M bytes each. That was expected from the larger adapter scope, but it was useful to see it in the retained checkpoint metadata rather than only as intuition.

8. GSM8K Evaluation

2026-05-10: the evaluation runner needed checkpoint awareness

The baseline path already existed, but checkpoint evaluation needed more safeguards. I added support for:

I also made the argument validation fail fast if a checkpoint path was passed without condition or seed. I did not want a LoRA checkpoint quietly defaulting to base.

Before launching the full sweep, I ran a one-example probe with attention_only seed 0. It returned 1/1 correct and confirmed the full path:

sampler URI -> tokenizer -> Tinker sample -> boxed-answer extraction -> scoring -> retained artifacts

The six eval runs

The full sweep ran at --concurrency 16, no limit, against the six selected sampler checkpoints.

The baseline stayed:

The six LoRA eval runs were:

conditionseedaccuracycorrect / totalextraction failures
attention_only00.90447308567096291193 / 13196
attention_only10.90674753601213041196 / 13192
attention_only20.90523123578468531194 / 13195
all_layer00.89916603487490521186 / 13195
all_layer10.90219863532979531190 / 13194
all_layer20.90144048521607281189 / 13198

Aggregated by condition:

conditionmean accuracyminmaxrange
attention_only0.90548395248925960.90447308567096290.90674753601213040.0022744503411676
all_layer0.90093505180692450.89916603487490520.90219863532979530.0030326004548901

GSM8K accuracy by condition

The mean gap was 0.0045489006823352, or about 0.455 percentage points, in favor of attention-only. Under the frozen rule, that is below the 0.01 threshold for a winner claim. So the honest reading is local: attention-only ended higher in this run set and used a smaller adapter, but the result stays inside the pre-declared inconclusive band.

The disagreement view made the result easier to read

The net accuracy chart showed the final result, but the disagreement chart explained its shape better.

Per-seed prediction disagreement on GSM8K

For paired seeds:

seedattention-only onlyall-layer onlyboth correctboth wrongdelta
05447113979+7
14943114780+6
24641114884+5

The two conditions mostly agree, with the difference coming from a narrow band of examples where attention-only is correct slightly more often than the reverse.

9. Figures And Cleanup

2026-05-10: the figure pass changed how I told the result

The figure pipeline produced eight retained artifacts. This worklog uses four of them:

The figure pass forced me to separate claims that were getting blurred together:

The fig_04 design took the most iteration. The original idea was a slope/delta view showing net correct-count deltas of +7, +6, and +5. It kept reading like a progression over time or a tiny effect exaggerated by the axis.

The raw log records several rejected forms: vertical slope, leader labels, corner text boxes, horizontal dumbbell, vertical paired columns, and delta-only bars. The recurring problem was that the visual form was louder than the claim. A 0.4 to 0.5 percentage-point gap should not be drawn like a dramatic swing.

The final chart became per-seed disagreement small multiples. That matched the actual question better: where do the two conditions disagree?

The log records the final contingency numbers before implementation:

seed 0: attention_only_only=54, all_layer_only=47, both_correct=1139, both_wrong=79
seed 1: attention_only_only=49, all_layer_only=43, both_correct=1147, both_wrong=80
seed 2: attention_only_only=46, all_layer_only=41, both_correct=1148, both_wrong=84

I kept the filename as fig_04_paired_seed_slope so existing artifact paths did not move, but the chart itself became the disagreement view.

The cleanup rule

The result can say:

The result cannot say:

10. What This Leaves Me With

The final comparison was the easy sentence at the end of a lot of setup.

The result is small, but the path to making it trustworthy was not.

If I were doing the next version, I would keep the same discipline: narrow the question first, freeze contracts before seeing benchmark numbers, treat dataset audits as part of the experiment, and make every convenience change prove that it does not change the comparison. The follow-up question I would carry forward is when a narrower adapter is enough, and when broader adaptation is worth the extra size.