INDEX
Explanations
phrases related to self-referential concepts and self-directed actions
New Auto-Interp
Negative Logits
ãĥ©ãĤ¹
-0.16
uted
-0.14
uen
-0.14
_callbacks
-0.14
ayed
-0.14
orelease
-0.13
ÙĪØ§Ø±
-0.13
otton
-0.13
opa
-0.13
eming
-0.13
POSITIVE LOGITS
same
0.25
-contained
0.24
ishly
0.23
lessly
0.22
ridge
0.22
änd
0.21
Contained
0.21
explanatory
0.21
contained
0.21
hood
0.19
Activations Density 0.024%