INDEX
Explanations
statements expressing honesty or frankness
New Auto-Interp
Negative Logits
eynman
-0.65
réfugiés
-0.64
برانيه
-0.64
AndWait
-0.63
fpm
-0.63
aguya
-0.63
:✨
-0.60
порядка
-0.59
льше
-0.59
bénévoles
-0.59
POSITIVE LOGITS
Frankly
0.96
frankly
0.95
Honestly
0.92
Honestly
0.83
honestly
0.83
disambiguazione
0.71
说实话
0.68
honestly
0.64
admit
0.61
tbh
0.60
Activations Density 0.110%