INDEX
Explanations
words related to negative sentiments or feelings, particularly strong dislike or hatred
instances of the term "loathe" or related expressions of strong dislike
New Auto-Interp
Negative Logits
Norn
-0.84
rity
-0.81
*/(
-0.76
hower
-0.76
glass
-0.75
rition
-0.72
sonian
-0.71
lished
-0.69
ITAL
-0.67
manship
-0.67
POSITIVE LOGITS
oser
1.04
aning
1.02
aned
1.01
veland
1.00
aves
0.95
fty
0.92
zzle
0.92
ppy
0.91
ven
0.90
obb
0.90
Activations Density 0.009%