INDEX
Explanations
phrases indicating uncertainty or speculation
negative assessments or perceptions of various subjects
New Auto-Interp
Negative Logits
pez
-0.93
instead
-0.78
instead
-0.77
itton
-0.72
ilts
-0.69
nonetheless
-0.68
doubtless
-0.66
éĹĺ
-0.66
undoubtedly
-0.64
Rouge
-0.63
POSITIVE LOGITS
anymore
1.26
bothered
1.04
remotely
1.00
nor
0.99
anywhere
0.96
anything
0.95
any
0.95
necessarily
0.94
whatsoever
0.94
terribly
0.89
Activations Density 0.123%