INDEX
Explanations
phrases that justify actions or beliefs
New Auto-Interp
Negative Logits
lland
-0.16
zin
-0.15
Tube
-0.15
Tube
-0.14
ela
-0.14
Leod
-0.14
igure
-0.14
Gatt
-0.14
éĤ£ç§į
-0.13
esModule
-0.13
POSITIVE LOGITS
mere
0.18
anga
0.18
doesn
0.15
mere
0.15
åı¸
0.14
kus
0.14
ABCDEFGHIJKLMNOP
0.14
Conj
0.14
chal
0.14
shouldn
0.14
Activations Density 0.077%