INDEX
Explanations
phrases focused on evaluating rules, concepts, and distinctions
New Auto-Interp
Negative Logits
ufs
-0.17
wonder
-0.15
fore
-0.14
izr
-0.14
/xhtml
-0.14
stoup
-0.14
žel
-0.14
FS
-0.14
sorte
-0.14
enson
-0.13
POSITIVE LOGITS
versus
0.26
vs
0.25
-vs
0.19
vice
0.18
Vs
0.17
/how
0.16
_vs
0.16
/non
0.16
/not
0.15
имÑĥ
0.14
Activations Density 0.103%