INDEX
Explanations
negative descriptors and references to moral wrongdoing
New Auto-Interp
Negative Logits
aison
-0.17
èn
-0.16
Abstract
-0.14
breeze
-0.14
Gir
-0.14
fragrance
-0.14
Prest
-0.13
uras
-0.13
quia
-0.13
Fre
-0.13
POSITIVE LOGITS
mort
0.18
mue
0.15
arius
0.15
ipel
0.14
nop
0.14
_simps
0.14
imei
0.14
PILE
0.14
adu
0.14
lest
0.14
Activations Density 0.098%