INDEX
Explanations
references to the word "ant" and its variations, indicating a specific focus on that term in different contexts
New Auto-Interp
Negative Logits
rl
-0.17
rig
-0.17
rint
-0.16
ra
-0.16
strup
-0.16
ryo
-0.15
hib
-0.15
ront
-0.15
ract
-0.15
riz
-0.15
POSITIVE LOGITS
y
0.23
ucket
0.22
yne
0.22
elope
0.20
ing
0.20
enna
0.19
woord
0.19
yre
0.19
werp
0.18
ech
0.17
Activations Density 0.033%