INDEX
Explanations
phrases indicating deception or betrayal
New Auto-Interp
Negative Logits
iaux
-0.16
okrat
-0.16
assage
-0.15
arges
-0.15
ocos
-0.15
heiro
-0.14
parer
-0.14
êu
-0.14
lexport
-0.14
ContentPane
-0.14
POSITIVE LOGITS
giveaway
0.34
betray
0.31
reveal
0.30
clues
0.29
revealing
0.29
giveaways
0.29
Reve
0.29
clue
0.28
reve
0.27
reveals
0.27
Activations Density 0.129%