INDEX
Explanations
expressions of preference or desire
New Auto-Interp
Negative Logits
all
-0.15
ixon
-0.15
غÙħ
-0.15
they
-0.14
rl
-0.14
ru
-0.14
Anchor
-0.14
lp
-0.14
uche
-0.14
anchor
-0.14
POSITIVE LOGITS
aug
0.18
nothing
0.18
ableObject
0.17
to
0.15
аÑĢÑħ
0.15
feedback
0.15
лиÑħ
0.15
lessly
0.15
entially
0.14
us
0.14
Activations Density 0.016%