INDEX
Explanations
phrases that indicate honesty or truthfulness
New Auto-Interp
Negative Logits
inz
-0.15
illa
-0.15
swick
-0.14
ardin
-0.14
Gallagher
-0.14
ange
-0.14
ÙĦØ·
-0.14
PURE
-0.14
press
-0.14
ìŀĪëĭ¤ëĬĶ
-0.14
POSITIVE LOGITS
-Sah
0.17
ças
0.15
éal
0.15
aired
0.15
亮
0.14
auses
0.14
atorium
0.14
icone
0.14
odyn
0.14
¢åįķ
0.14
Activations Density 0.108%