INDEX
Explanations
the word "hate" at various intensities
expressions of hatred
New Auto-Interp
Negative Logits
ItemImage
-0.84
aqu
-0.82
aunder
-0.81
igmatic
-0.80
istics
-0.76
uggest
-0.75
aver
-0.74
DragonMagazine
-0.74
enture
-0.73
arta
-0.73
POSITIVE LOGITS
fully
1.11
hated
0.93
hate
0.87
FUL
0.80
hates
0.79
wasting
0.79
Hate
0.78
76561
0.75
hate
0.75
ful
0.75
Activations Density 0.021%