INDEX
Explanations
commands and instructions related to enabling or disabling features in software or systems
New Auto-Interp
Negative Logits
-
-0.63
↵
-0.63
/
-0.60
?
-0.56
,
-0.52
(
-0.52
(
-0.52
N
-0.51
l
-0.50
<eos>
-0.50
POSITIVE LOGITS
disable
2.23
Disable
1.96
disable
1.94
Disable
1.70
disabling
1.62
disables
1.48
deactivate
1.23
DISABLE
1.22
DISABLE
1.20
Enable
1.15
Activations Density 0.035%