INDEX
Explanations
mentions of academic journals
the definite article "the" in various contexts
New Auto-Interp
Negative Logits
besides
-0.75
atics
-0.71
depended
-0.70
mutants
-0.67
withstand
-0.67
countered
-0.66
ampires
-0.66
beware
-0.65
survives
-0.65
safely
-0.65
POSITIVE LOGITS
latter
1.02
aforementioned
0.99
latest
0.97
entirety
0.92
publication
0.89
outset
0.89
earliest
0.87
same
0.87
Journal
0.86
slightest
0.81
Activations Density 0.652%