INDEX
Explanations
attacks/abuse
This neuron detects personal‐attack language, i.e. tokens related to insults or abusive “personal attacks.”
New Auto-Interp
Negative Logits
แล
-0.08
tester
-0.07
deductions
-0.06
template
-0.06
機
-0.06
-independent
-0.06
даних
-0.06
书
-0.06
سازمان
-0.06
lässt
-0.06
POSITIVE LOGITS
atial
0.07
"."
0.06
조선
0.06
Khi
0.06
制
0.06
ursion
0.06
ॉस
0.06
amacare
0.06
:Int
0.05
bate
0.05
Activations Density 0.012%