INDEX
Explanations
references to torture and suffering
New Auto-Interp
Negative Logits
UnderTest
-0.15
ardon
-0.15
lining
-0.14
ê·Ģ
-0.14
yg
-0.14
ially
-0.14
hiro
-0.14
اØŃÛĮ
-0.14
alties
-0.13
ikat
-0.13
POSITIVE LOGITS
ofil
0.15
/plain
0.15
Plain
0.14
inct
0.14
oenix
0.14
باØŃ
0.14
ANTI
0.14
lixir
0.13
aylight
0.13
bev
0.13
Activations Density 0.010%