INDEX
Explanations
references to the concept of deception or lies
New Auto-Interp
Negative Logits
oki
-0.17
stown
-0.15
olian
-0.15
oded
-0.15
idan
-0.14
OME
-0.14
odable
-0.14
iei
-0.14
zyst
-0.14
ãĥIJãĥ¼
-0.14
POSITIVE LOGITS
uten
0.26
utenant
0.23
berman
0.23
chten
0.20
ê´Ģ
0.19
urance
0.16
eg
0.16
gth
0.16
apis
0.15
istrovstvÃŃ
0.15
Activations Density 0.015%