INDEX
Explanations
references to behavioral change and social conduct
New Auto-Interp
Negative Logits
599
-0.16
elow
-0.14
inding
-0.13
insult
-0.13
ά
-0.13
maj
-0.13
ibil
-0.13
à¥ĩदन
-0.13
ाहत
-0.13
üç
-0.13
POSITIVE LOGITS
behavior
0.88
behaviour
0.81
Behavior
0.73
behaviors
0.73
behavior
0.68
è¡Į为
0.67
behaviours
0.64
conduct
0.63
Behaviour
0.63
Behavior
0.60
Activations Density 0.451%