INDEX
Explanations
phrases related to deceit or untruthfulness
New Auto-Interp
Negative Logits
afort
-0.15
VL
-0.14
Rewards
-0.14
λά
-0.14
zn
-0.14
/modules
-0.14
kız
-0.14
zew
-0.14
zes
-0.14
watchers
-0.14
POSITIVE LOGITS
osy
0.16
ypi
0.15
ernote
0.15
anza
0.14
:host
0.14
/block
0.14
оваÑĢ
0.14
iyan
0.14
aÄį
0.14
adeon
0.14
Activations Density 0.054%