INDEX
Negative Logits
_regs
-0.07
filho
-0.07
_LD
-0.06
snad
-0.06
} ↵ ↵
-0.06
','
-0.06
.same
-0.06
_bank
-0.06
wildfires
-0.06
-framework
-0.06
POSITIVE LOGITS
deceived
0.10
deception
0.09
deceptive
0.08
deceive
0.08
deceit
0.06
dece
0.06
[sub
0.06
ookie
0.06
psilon
0.06
reasoning
0.06
Activations Density 0.009%