INDEX
Explanations
disclaimers/rules
The neuron chiefly responds to punctuation tokens (commas and periods), especially in the assistant’s refusal/apology phrasing.
responses that promote respect and non-discrimination towards individuals and groups.
New Auto-Interp
Negative Logits
латы
-0.07
vej
-0.06
pastry
-0.06
ников
-0.06
redistributed
-0.06
owan
-0.06
Gad
-0.06
idders
-0.06
�
-0.06
اهش
-0.06
POSITIVE LOGITS
:↵
0.07
年の
0.07
---------↵
0.07
Large
0.07
gerekmektedir
0.06
[]; ↵
0.06
_cou
0.06
.ToBoolean
0.06
.*↵
0.06
.auth
0.06
Activations Density 0.036%