INDEX
Explanations
phrases related to honesty and self-reflection
New Auto-Interp
Negative Logits
bay
-0.16
Flem
-0.15
omik
-0.15
stant
-0.15
ukan
-0.14
avia
-0.14
оваÑĢи
-0.14
Fleming
-0.14
ãĥ¡ãĥ©
-0.14
hoe
-0.14
POSITIVE LOGITS
enthal
0.18
ologne
0.16
abler
0.15
andas
0.14
Strauss
0.14
chick
0.14
gov
0.14
umbo
0.14
nist
0.13
generado
0.13
Activations Density 0.348%