INDEX
Explanations
references to authoritative figures or institutions
occurrences of the word "the."
New Auto-Interp
Negative Logits
thereof
-0.62
thereby
-0.57
respectively
-0.57
thood
-0.57
.
-0.55
iffe
-0.55
wen
-0.55
elaide
-0.54
namely
-0.54
âĢł
-0.54
POSITIVE LOGITS
simplest
1.08
slightest
1.07
same
1.03
smallest
1.02
oret
0.99
widest
0.95
easiest
0.93
vast
0.93
entirety
0.92
largest
0.92
Activations Density 1.479%