INDEX
Explanations
occurrences of the word "instructions" and its variations
New Auto-Interp
Negative Logits
harem
-0.81
Neve
-0.80
Nema
-0.79
Gaps
-0.78
GAP
-0.77
Nemesis
-0.76
Kuz
-0.76
كومونز
-0.75
Kuz
-0.74
ponses
-0.74
POSITIVE LOGITS
instructions
2.65
Instructions
2.37
instruction
2.33
instructions
2.17
Instructions
2.13
Instruction
2.08
instruct
1.95
instructed
1.95
INSTRUCTIONS
1.93
Instruction
1.91
Activations Density 0.075%