INDEX
Explanations
action or movement-related terms
New Auto-Interp
Negative Logits
اÙĨÙĬØ©
-0.17
.Invariant
-0.15
sein
-0.15
dfs
-0.15
abus
-0.14
ampions
-0.14
驾
-0.14
ilded
-0.14
onward
-0.14
uchos
-0.14
POSITIVE LOGITS
towards
0.21
toward
0.21
_DEFINE
0.15
Towards
0.15
away
0.14
past
0.14
Towards
0.14
gli
0.14
564
0.14
step
0.14
Activations Density 0.022%