INDEX
Explanations
discussions about hypocrisy and moral inconsistencies in behavior and beliefs
New Auto-Interp
Negative Logits
lob
-0.16
Mechan
-0.16
enef
-0.14
ãĥ¼ãĥĵ
-0.14
wit
-0.14
rost
-0.14
toolbox
-0.14
mercy
-0.14
liž
-0.13
eka
-0.13
POSITIVE LOGITS
behavior
0.42
conduct
0.40
è¡Į为
0.38
actions
0.38
behaviour
0.36
behaviors
0.36
Behavior
0.35
behavior
0.34
Behavior
0.31
повед
0.31
Activations Density 0.301%