INDEX
Explanations
the word "the" when followed by short phrases
instances of the word "the."
New Auto-Interp
Negative Logits
instead
-0.76
Layer
-0.74
zai
-0.74
.<
-0.71
worn
-0.71
whilst
-0.71
whereas
-0.70
.</
-0.68
rade
-0.68
because
-0.68
POSITIVE LOGITS
aforementioned
1.17
latter
1.11
latest
0.94
ses
0.89
slightest
0.88
same
0.88
largest
0.86
greatest
0.85
oret
0.85
entire
0.84
Activations Density 0.684%