INDEX
Explanations
instances of the word "on" and its variations
New Auto-Interp
Negative Logits
gether
-0.19
lessly
-0.19
ún
-0.19
wicklung
-0.17
ories
-0.16
nze
-0.15
wick
-0.15
jack
-0.15
fighter
-0.14
ophil
-0.14
POSITIVE LOGITS
us
0.21
coming
0.21
look
0.19
inous
0.19
emin
0.18
yx
0.18
again
0.17
lsa
0.17
rush
0.17
ep
0.16
Activations Density 0.042%