INDEX
Explanations
phrases indicating direction or purpose
New Auto-Interp
Negative Logits
iw
-0.15
Cyril
-0.15
lá
-0.15
zb
-0.15
uhl
-0.14
Bowen
-0.14
íĭĢ
-0.14
yth
-0.14
abay
-0.14
cy
-0.13
POSITIVE LOGITS
enaire
0.16
orig
0.15
Bark
0.14
Deng
0.14
osi
0.14
avigate
0.14
osp
0.13
osate
0.13
pij
0.13
*pi
0.13
Activations Density 0.021%