INDEX
Explanations
references to "direction" or related terms
New Auto-Interp
Negative Logits
ediator
-0.17
endale
-0.16
ниÑĩеÑģ
-0.15
arend
-0.15
евиÑĩ
-0.15
edo
-0.15
ardy
-0.15
ê
-0.14
nda
-0.14
erman
-0.14
POSITIVE LOGITS
ally
0.18
ality
0.17
yes
0.16
-thinking
0.15
atty
0.15
749
0.15
(direction
0.15
direction
0.15
toward
0.15
nings
0.15
Activations Density 0.077%