INDEX
Explanations
words related to restrictions or limitations
New Auto-Interp
Negative Logits
usi
-0.18
andra
-0.15
uh
-0.15
vv
-0.15
AWN
-0.14
wick
-0.14
isay
-0.14
rij
-0.14
ans
-0.14
inz
-0.14
POSITIVE LOGITS
buy
0.19
bu
0.18
bt
0.17
byn
0.16
byt
0.15
byl
0.15
ëį°ìĿ´íĬ¸
0.15
bye
0.14
.ali
0.14
á»Ŀ
0.14
Activations Density 0.111%