INDEX
Explanations
rhetorical questions or expressions of surprise
New Auto-Interp
Negative Logits
gni
-0.16
eniable
-0.15
enty
-0.14
nio
-0.14
anything
-0.14
ulty
-0.14
mie
-0.14
ito
-0.14
emonic
-0.14
ä¹ħ
-0.14
POSITIVE LOGITS
else
0.21
do
0.20
aya
0.19
did
0.19
better
0.18
timing
0.18
timing
0.17
soever
0.17
ser
0.16
else
0.15
Activations Density 0.083%