INDEX
Explanations
phrases indicating expectations or normative comparisons
New Auto-Interp
Negative Logits
igham
-0.70
cal
-0.67
OIL
-0.63
Oracle
-0.63
akespeare
-0.63
EntityItem
-0.62
wcsstore
-0.62
dated
-0.61
yles
-0.60
ModLoader
-0.57
POSITIVE LOGITS
.
0.68
Meet
0.62
sided
0.60
opausal
0.60
arious
0.60
hov
0.59
uble
0.59
meets
0.58
Cause
0.58
anyway
0.58
Activations Density 0.087%