INDEX
Explanations
references to moral and ethical concepts
New Auto-Interp
Negative Logits
el
-0.16
ary
-0.16
antino
-0.15
247
-0.15
eded
-0.15
Vladim
-0.15
edin
-0.15
erson
-0.15
elan
-0.15
umar
-0.14
POSITIVE LOGITS
Mor
0.23
Mor
0.23
MOR
0.21
izing
0.21
fiber
0.20
Fiber
0.19
atorium
0.18
mor
0.18
Moral
0.18
compass
0.17
Activations Density 0.007%