INDEX
Explanations
concepts related to morality and decision-making
New Auto-Interp
Negative Logits
76561
-0.77
thous
-0.68
hurd
-0.64
anwhile
-0.63
ikarp
-0.59
nces
-0.59
ãĥ¼ãĥĨãĤ£
-0.58
culosis
-0.55
vice
-0.55
mud
-0.55
POSITIVE LOGITS
alike
1.22
depending
1.17
depending
1.06
respectively
0.95
modes
0.71
eras
0.71
;
0.69
.
0.66
BW
0.66
dich
0.62
Activations Density 0.342%