INDEX
Explanations
occurrences of harmful or violent actions
New Auto-Interp
Negative Logits
Stevenson
-0.15
Dank
-0.15
640
-0.14
ledo
-0.14
oppel
-0.14
iba
-0.14
Debe
-0.14
hours
-0.14
æľį
-0.14
odzi
-0.14
POSITIVE LOGITS
ansom
0.19
sher
0.18
etin
0.15
aiser
0.15
endoza
0.15
-LAST
0.14
nock
0.14
én
0.14
ecome
0.14
bic
0.13
Activations Density 0.048%