INDEX
Explanations
strong negative emotions or hostility, particularly related to hatred
references to hatred and its various expressions and implications
New Auto-Interp
Negative Logits
UNCH
-0.83
helicop
-0.82
ODE
-0.80
umm
-0.73
USE
-0.69
å¸
-0.67
aqu
-0.66
glas
-0.64
change
-0.61
AMA
-0.61
POSITIVE LOGITS
hatred
0.95
towards
0.90
prejudice
0.83
yip
0.83
toward
0.82
vengeance
0.82
ãĥĨ
0.78
lessly
0.78
wart
0.77
rage
0.74
Activations Density 0.029%