INDEX
Explanations
negative attributes or qualities
expressions of negativity or criticism
New Auto-Interp
Negative Logits
ĸļ
-0.80
arov
-0.76
ynthesis
-0.73
ovember
-0.73
hens
-0.73
ktop
-0.72
illation
-0.71
ellation
-0.71
agos
-0.70
Revolution
-0.69
POSITIVE LOGITS
dest
0.98
karma
0.88
enough
0.80
bye
0.80
dies
0.78
vib
0.77
Samar
0.77
enough
0.76
undermin
0.75
die
0.74
Activations Density 0.020%