INDEX
Explanations
phrases indicating actions or directives
New Auto-Interp
Negative Logits
urus
-0.17
riad
-0.15
vla
-0.15
COPE
-0.15
wick
-0.15
ussen
-0.15
nid
-0.15
opard
-0.14
å¹
-0.14
ربÙĩ
-0.14
POSITIVE LOGITS
iams
0.18
ooks
0.17
lain
0.17
imator
0.15
asts
0.15
l
0.15
piler
0.15
.ret
0.14
IJ
0.14
erw
0.14
Activations Density 0.020%