INDEX
Explanations
terms associated with self-righteousness and superiority in moral contexts
New Auto-Interp
Negative Logits
Lag
-0.17
lore
-0.16
anim
-0.15
lea
-0.14
ussen
-0.14
ay
-0.14
chat
-0.14
ta
-0.13
aan
-0.13
"
-0.13
POSITIVE LOGITS
ismu
0.16
alf
0.16
ToOne
0.16
ism
0.16
aggio
0.15
Vul
0.15
ulance
0.15
Foto
0.15
edn
0.15
+xml
0.14
Activations Density 0.116%