INDEX
Explanations
contradictory statements
questions regarding the necessity or motivation behind actions
New Auto-Interp
Negative Logits
fixme
-0.60
WT
-0.59
but
-0.59
Travels
-0.59
osi
-0.58
folios
-0.56
USD
-0.56
But
-0.55
schild
-0.55
orie
-0.54
POSITIVE LOGITS
nonetheless
1.48
etheless
1.15
nevertheless
1.10
anyway
0.73
darn
0.71
outwe
0.67
awfully
0.65
anyways
0.64
stubborn
0.60
caution
0.59
Activations Density 1.603%