INDEX
Explanations
words related to delusion or deception
New Auto-Interp
Negative Logits
ISON
-0.19
yny
-0.18
ç§
-0.17
wheel
-0.15
zzo
-0.15
yn
-0.15
viên
-0.15
ably
-0.14
lou
-0.14
ellite
-0.14
POSITIVE LOGITS
del
0.27
uge
0.23
(del
0.23
oit
0.22
ivered
0.22
aware
0.21
.del
0.21
Del
0.21
phin
0.21
iber
0.21
Activations Density 0.019%