INDEX
Explanations
actions related to movement or direction
New Auto-Interp
Negative Logits
ont
-0.15
kt
-0.15
t
-0.15
asm
-0.14
Occ
-0.14
Ã¥l
-0.14
abez
-0.14
vas
-0.13
ly
-0.13
aling
-0.13
POSITIVE LOGITS
chwitz
0.17
orth
0.15
ãģ
0.15
Predictor
0.15
ORTH
0.14
ynn
0.13
phia
0.13
osten
0.13
adc
0.13
inh
0.13
Activations Density 0.004%