INDEX
Explanations
words related to negative feelings such as loathing
instances of the word "loathe" in various forms and context
New Auto-Interp
Negative Logits
rition
-0.83
rity
-0.76
pillar
-0.75
manship
-0.72
ITAL
-0.71
lished
-0.71
glass
-0.70
Norn
-0.70
TAIN
-0.68
race
-0.66
POSITIVE LOGITS
oser
1.09
aves
1.06
veland
1.04
aning
1.03
fty
1.03
vers
0.94
aned
0.94
zzle
0.93
vel
0.91
zz
0.91
Activations Density 0.015%