INDEX
Explanations
frequent use of the word "the."
New Auto-Interp
Negative Logits
Sense
-0.16
dej
-0.15
arris
-0.15
etto
-0.15
vie
-0.15
herent
-0.14
tein
-0.14
sense
-0.14
bers
-0.14
oen
-0.14
POSITIVE LOGITS
orex
0.20
ediator
0.17
Frozen
0.15
缸åIJĮ
0.15
Nazi
0.15
Babe
0.15
kro
0.15
Frozen
0.14
TT
0.14
oci
0.14
Activations Density 0.139%