INDEX
Explanations
phrases indicating surprise or strong emphasis
phrases indicating negation or denial
New Auto-Interp
Negative Logits
RAFT
-0.70
roxy
-0.63
ULTS
-0.61
ousand
-0.60
Posts
-0.60
Comes
-0.59
inese
-0.57
rox
-0.56
perse
-0.56
CENT
-0.56
POSITIVE LOGITS
xious
1.23
longer
1.17
ct
0.98
except
0.91
doubt
0.91
matter
0.84
exception
0.84
indication
0.83
otrop
0.82
exaggeration
0.77
Activations Density 0.044%