INDEX
Explanations
phrases indicating denial or negation
New Auto-Interp
Negative Logits
EG
-0.16
ants
-0.14
886
-0.14
entic
-0.14
HEL
-0.14
rosso
-0.14
ivi
-0.13
Chamber
-0.13
594
-0.13
alus
-0.13
POSITIVE LOGITS
erson
0.17
sez
0.15
eker
0.15
Auditor
0.15
ublic
0.15
Mist
0.14
ertz
0.14
.elem
0.14
ernal
0.14
æ²
0.14
Activations Density 0.006%