INDEX
Explanations
phrases expressing abilities or the capacity to take action
New Auto-Interp
Negative Logits
ransition
-0.15
æĭľ
-0.14
Val
-0.14
hte
-0.13
.Library
-0.13
zcze
-0.13
olen
-0.13
clusion
-0.13
ordes
-0.13
hl
-0.13
POSITIVE LOGITS
afford
0.20
stomach
0.17
lassen
0.17
HANDLE
0.15
possibly
0.15
hazard
0.15
_aff
0.15
handle
0.14
’t
0.14
容
0.14
Activations Density 0.148%