INDEX
Explanations
language indicating moral outrage and condemnation of unethical behavior
New Auto-Interp
Negative Logits
gii
-0.15
oplayer
-0.14
ubic
-0.14
Erot
-0.14
ære
-0.14
665
-0.14
verage
-0.14
hass
-0.13
muschi
-0.13
éric
-0.13
POSITIVE LOGITS
hide
0.38
hor
0.33
rep
0.32
des
0.31
sick
0.31
hor
0.29
hide
0.28
gh
0.28
repell
0.26
he
0.26
Activations Density 0.378%