INDEX
Explanations
references to moral conflicts and behavioral warnings
New Auto-Interp
Negative Logits
ected
-0.17
oret
-0.15
aster
-0.15
erce
-0.15
atin
-0.14
gr
-0.14
rait
-0.14
etto
-0.14
anto
-0.14
once
-0.14
POSITIVE LOGITS
_nl
0.15
Hear
0.15
ORIZONTAL
0.15
abd
0.15
PEND
0.14
ipur
0.14
.assertIs
0.14
orb
0.14
tid
0.14
OV
0.14
Activations Density 0.044%