INDEX
Explanations
phrases related to specific entities or concepts
the definite article "the."
New Auto-Interp
Negative Logits
Ò
-0.81
thood
-0.77
elaide
-0.74
because
-0.72
leground
-0.72
aba
-0.71
eno
-0.68
!!!!
-0.66
arate
-0.66
rage
-0.65
POSITIVE LOGITS
resa
1.07
simplest
1.04
slightest
1.02
biggest
1.02
oret
1.00
latter
0.99
vast
0.98
majority
0.98
entire
0.98
oldest
0.97
Activations Density 0.256%