INDEX
Explanations
words related to controversial or negative information, particularly regarding political or societal issues
New Auto-Interp
Negative Logits
arts
-0.80
ulton
-0.79
wisely
-0.73
Adviser
-0.72
ĺħ
-0.72
ĸļ
-0.71
agine
-0.69
gerald
-0.68
aido
-0.68
nan
-0.68
POSITIVE LOGITS
hostility
0.86
disregard
0.81
racism
0.78
malice
0.75
refusal
0.74
contradiction
0.73
denial
0.72
rejection
0.72
ban
0.71
sexism
0.71
Activations Density 0.022%