INDEX
Explanations
references to violent actions and their consequences
New Auto-Interp
Negative Logits
needle
-0.17
ends
-0.16
ãĥ¼ãĥĹ
-0.15
_unpack
-0.15
oric
-0.15
Minds
-0.14
.ERR
-0.14
irts
-0.14
anax
-0.14
allon
-0.14
POSITIVE LOGITS
square
0.25
temple
0.18
below
0.18
BELOW
0.18
grazing
0.17
solar
0.17
across
0.16
hard
0.16
é¢Ŀ
0.16
Square
0.16
Activations Density 0.047%