INDEX
Explanations
disclaimers
This neuron activates on tokens that mark a disclaimer or editorial aside—e.g. “Disclaimer:”, “PS:”, “Edit:”—flagging in-text notes or warning labels.
sentences expressing discontent or criticism towards social behavior and interactions.
New Auto-Interp
Negative Logits
terrorist
-0.07
rovněž
-0.07
.Help
-0.06
Rico
-0.06
roses
-0.06
ipment
-0.06
Theater
-0.06
Tweet
-0.06
butterfly
-0.06
descr
-0.06
POSITIVE LOGITS
appell
0.07
UBLE
0.07
_BIN
0.07
чис
0.07
ouble
0.07
$('0.06
accuracy
0.06
loc
0.06
월까지
0.06
interchangeable
0.06
Activations Density 0.027%