INDEX
Explanations
phrases that encourage visiting websites or links
New Auto-Interp
Negative Logits
arena
-0.16
оло
-0.15
olo
-0.15
u
-0.15
zar
-0.14
ir
-0.13
indir
-0.13
ano
-0.13
lim
-0.13
arten
-0.13
POSITIVE LOGITS
www
0.19
https
0.17
http
0.16
www
0.16
ÑĮ
0.15
rang
0.15
inke
0.15
lamaz
0.14
Ĥæķ°
0.14
lava
0.14
Activations Density 0.018%