INDEX
Explanations
negations and contrasts in text
New Auto-Interp
Negative Logits
çļ
-0.81
kamp
-0.76
å¥
-0.75
çīĪ
-0.72
cano
-0.70
USH
-0.68
velt
-0.68
unders
-0.66
ongs
-0.64
ously
-0.64
POSITIVE LOGITS
necessarily
1.42
icable
1.23
icably
1.11
exactly
1.07
eworthy
1.03
withstanding
0.98
orious
0.97
entirely
0.97
uncommon
0.96
epad
0.95
Activations Density 0.548%