INDEX
Explanations
phrases indicating contrast or negation
references to significant events or facts that are often exaggerated or misrepresented
New Auto-Interp
Negative Logits
WT
-0.73
acca
-0.67
arine
-0.66
igate
-0.66
agos
-0.63
ecast
-0.61
inas
-0.60
Travels
-0.60
Dialogue
-0.58
ukong
-0.58
POSITIVE LOGITS
nonetheless
1.57
nevertheless
1.35
etheless
1.02
still
0.82
retained
0.75
darn
0.74
strangely
0.73
awfully
0.72
proble
0.71
undeniably
0.70
Activations Density 1.184%