INDEX
Explanations
Words related to negative behavior or treatment
terms related to cruelty and inhumanity
New Auto-Interp
Negative Logits
TD
-0.78
Met
-0.75
Cit
-0.75
Amar
-0.74
MET
-0.72
ERT
-0.70
Kislyak
-0.70
mint
-0.70
Jarrett
-0.69
MP
-0.69
POSITIVE LOGITS
cruel
3.12
cruelty
3.01
Cruel
2.55
humane
2.12
humane
2.01
inhuman
1.76
barbaric
1.68
cru
1.61
compassionate
1.54
brutality
1.47
Activations Density 0.029%