INDEX
Explanations
The neuron activates on policy‐style permission and prohibition words (e.g. “allowed,” “prohibited”), flagging tokens that signal what is or isn’t permitted.
New Auto-Interp
Negative Logits
фай
-0.06
international
-0.06
orama
-0.06
Readers
-0.06
ROY
-0.06
mil
-0.06
-0.06
分钟
-0.06
WaitForSeconds
-0.06
جهانی
-0.06
POSITIVE LOGITS
赤
0.07
icontains
0.07
zaměstn
0.07
umb
0.07
elines
0.06
emap
0.06
.splice
0.06
sufficient
0.06
amplify
0.06
.flatMap
0.06
Activations Density 0.006%