INDEX
Explanations
references to betrayal and moral conflict
New Auto-Interp
Negative Logits
ksen
-0.16
harass
-0.16
omedical
-0.15
ictim
-0.15
amework
-0.14
bumper
-0.14
arge
-0.14
رÙĤ
-0.14
harassment
-0.14
agle
-0.13
POSITIVE LOGITS
trait
0.67
betray
0.54
trait
0.54
betrayal
0.53
Bet
0.47
betr
0.47
betrayed
0.46
bet
0.44
tre
0.43
_trait
0.43
Activations Density 0.339%