INDEX
Explanations
contractions with "n't"
negations or terms expressing disagreement
New Auto-Interp
Negative Logits
don
-0.71
rog
-0.69
çļ
-0.68
iers
-0.68
inav
-0.65
èĪ
-0.64
antine
-0.63
cano
-0.62
inen
-0.62
Publications
-0.61
POSITIVE LOGITS
exactly
1.12
necessarily
1.10
gonna
1.01
quite
0.97
supposed
0.87
really
0.86
epad
0.85
kidding
0.84
icable
0.84
bothering
0.82
Activations Density 0.077%