INDEX
Explanations
overly strong emotional language
instances of the word "loathe" and its variations
New Auto-Interp
Negative Logits
rity
-0.70
rition
-0.68
Buff
-0.66
ramid
-0.65
nesium
-0.65
Bravo
-0.60
Annotations
-0.60
chemistry
-0.59
LESS
-0.59
ITAL
-0.58
POSITIVE LOGITS
oser
1.00
aves
0.99
omed
0.93
aning
0.90
lder
0.88
aunted
0.83
gged
0.82
ith
0.82
0.79
aned
0.79
Activations Density 0.023%