INDEX
Explanations
instances of deception or pretense
New Auto-Interp
Negative Logits
/fw
-0.16
_orient
-0.16
comps
-0.15
ãģĦãĤĭ
-0.15
.DAL
-0.15
usercontent
-0.14
ç±
-0.14
oggle
-0.14
atron
-0.14
hang
-0.14
POSITIVE LOGITS
usto
0.17
ÅĤu
0.15
McCl
0.15
ugeot
0.15
sten
0.15
annah
0.15
ouri
0.14
Gap
0.14
inston
0.14
Gest
0.14
Activations Density 0.041%