INDEX
Explanations
references to self-awareness and self-identity
New Auto-Interp
Negative Logits
UnusedPrivate
-0.65
Treue
-0.61
XmlAccessorType
-0.61
relâche
-0.61
tenisky
-0.60
setupUi
-0.60
hänen
-0.58
montagnes
-0.57
Tikang
-0.57
"]/
-0.56
POSITIVE LOGITS
self
2.02
self
1.93
Self
1.83
Self
1.79
SELF
1.70
SELF
1.67
selves
1.52
selves
1.49
yourself
1.37
Yourself
1.34
Activations Density 0.237%