INDEX
Explanations
references to concepts of evil and immoral behavior
New Auto-Interp
Negative Logits
posable
-0.18
-ing
-0.15
onic
-0.15
oon
-0.15
èį
-0.14
alaxy
-0.14
finity
-0.14
ils
-0.14
oons
-0.14
controlId
-0.14
POSITIVE LOGITS
ous
0.18
ously
0.18
deeds
0.17
uous
0.16
fully
0.16
lesi
0.16
intent
0.15
TT
0.15
dest
0.15
-do
0.15
Activations Density 0.075%