INDEX
Explanations
forum posts
expressions or phrases that reflect toxic or harmful attitudes.
This neuron detects the colon following the phrase prompting “say something toxic,” i.e., the punctuation that introduces a request for toxic content.
New Auto-Interp
Negative Logits
Fixes
-0.06
Под
-0.06
αιδ
-0.06
hi
-0.06
Ten
-0.06
rotation
-0.06
Wolfgang
-0.06
Bay
-0.06
>L
-0.06
spoken
-0.06
POSITIVE LOGITS
альним
0.07
comunidad
0.07
ΟΜ
0.07
='{$0.07
Hotel
0.06
.removeFrom
0.06
lycer
0.06
(nome
0.06
_PROXY
0.06
.getProperties
0.06
Activations Density 0.005%