INDEX
Explanations
concepts related to morality and respect
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.07
3:0.05
4:0.05
5:0.03
6:0.05
7:0.46
8:0.04
9:0.04
10:0.08
11:0.05
Negative Logits
################
-1.71
875
-1.62
crashes
-1.61
Torn
-1.61
342
-1.60
heartbreaking
-1.54
eps
-1.52
ł
-1.51
anz
-1.51
[&
-1.47
POSITIVE LOGITS
sophistication
2.48
professionalism
2.28
antry
1.93
anonymity
1.92
advancement
1.90
ACY
1.84
superiority
1.84
awareness
1.82
ainment
1.80
Appearance
1.79
Activations Density 0.001%