INDEX
Explanations
concepts related to morality and ethical considerations
New Auto-Interp
Negative Logits
Ws
-0.20
WS
-0.19
WB
-0.19
WF
-0.19
W
-0.19
WL
-0.19
WM
-0.19
(W
-0.18
WP
-0.18
/W
-0.18
POSITIVE LOGITS
width
0.45
wide
0.42
;width
0.41
wealth
0.40
weakness
0.39
weak
0.39
widths
0.39
weekly
0.38
worst
0.38
widening
0.37
Activations Density 0.252%