INDEX
Explanations
references to violence and serious societal issues
New Auto-Interp
Negative Logits
chin
-0.16
ghosts
-0.14
éric
-0.14
ichel
-0.14
china
-0.14
öt
-0.14
chw
-0.14
kea
-0.14
verage
-0.14
anager
-0.14
POSITIVE LOGITS
hide
0.43
hor
0.42
hor
0.36
hide
0.34
rep
0.31
Hide
0.30
ab
0.29
Hor
0.29
gh
0.28
sick
0.28
Activations Density 0.380%