INDEX
Explanations
statements expressing opinion or commentary
New Auto-Interp
Negative Logits
arez
-0.71
orse
-0.68
acca
-0.68
ortium
-0.67
ornia
-0.67
bes
-0.65
uala
-0.65
yna
-0.64
undrum
-0.61
Journalists
-0.61
POSITIVE LOGITS
nonetheless
1.37
etheless
1.22
nevertheless
1.16
alas
0.86
beware
0.84
darn
0.81
anyways
0.73
damn
0.73
doesnt
0.72
prevailed
0.71
Activations Density 0.354%