INDEX
Explanations
statements that express safety or certainty in assumptions
New Auto-Interp
Negative Logits
iÅŁ
-0.15
uant
-0.15
andler
-0.15
iÅŁim
-0.15
turnstile
-0.15
ewis
-0.14
Masc
-0.14
ambiguous
-0.14
peats
-0.14
ucz
-0.13
POSITIVE LOGITS
assumption
0.21
stretch
0.21
safe
0.20
Safe
0.20
expectation
0.19
likelihood
0.19
Stretch
0.18
expecting
0.18
assume
0.17
likely
0.17
Activations Density 0.097%