INDEX
Explanations
words related to deception or falsehood
New Auto-Interp
Negative Logits
es
-0.18
eval
-0.18
ylvania
-0.17
lle
-0.16
laus
-0.16
ee
-0.15
little
-0.15
ele
-0.15
i
-0.14
eko
-0.14
POSITIVE LOGITS
ardy
0.18
овеÑĢ
0.17
quete
0.17
ÑĪив
0.16
lover
0.15
rus
0.15
krat
0.14
coholic
0.14
ÙĨاÙħÙĩ
0.14
Fall
0.14
Activations Density 0.010%