INDEX
Explanations
instances of betrayal and moral dilemmas
New Auto-Interp
Negative Logits
862
-0.16
kov
-0.15
-lite
-0.15
imiter
-0.15
lder
-0.14
ambre
-0.14
entitlement
-0.14
iali
-0.14
rastructure
-0.14
Wunused
-0.14
POSITIVE LOGITS
passion
0.17
pet
0.17
passions
0.17
il
0.17
Formal
0.16
pinch
0.16
Juda
0.15
cab
0.15
unnatural
0.15
import
0.15
Activations Density 0.406%