INDEX
Explanations
phrases with the structure "self-[word]"
phrases related to self-identity or self-awareness
New Auto-Interp
Negative Logits
ulhu
-1.04
"$:/
-0.83
Hutch
-0.74
AX
-0.70
Chains
-0.69
Rouge
-0.69
Basin
-0.68
Shaw
-0.68
Starr
-0.67
ÙIJ
-0.67
POSITIVE LOGITS
imposed
1.14
proclaimed
1.07
esteem
1.06
talk
1.05
conscious
1.04
destruct
1.04
contained
1.00
decl
0.98
generated
0.98
described
0.96
Activations Density 0.048%