INDEX
Explanations
content related to surprise or unexpected events
New Auto-Interp
Negative Logits
nev
-0.15
IDGE
-0.14
ký
-0.14
Prod
-0.14
uC
-0.14
rens
-0.13
bol
-0.13
ibre
-0.13
idge
-0.13
ìĪ
-0.13
POSITIVE LOGITS
Categories
0.18
lient
0.18
lash
0.16
رات
0.15
977
0.15
ден
0.14
Categories
0.14
.uni
0.14
άλ
0.14
_tac
0.14
Activations Density 0.003%