INDEX
Explanations
statements that reflect basic standards of human decency and moral judgments
New Auto-Interp
Negative Logits
ALAR
-0.16
ani
-0.15
SYS
-0.15
Cla
-0.15
aben
-0.15
ÙĨÚ¯
-0.14
u
-0.14
prom
-0.14
t
-0.14
Systems
-0.14
POSITIVE LOGITS
egin
0.16
Fang
0.16
basic
0.15
Spoiler
0.15
iners
0.15
onec
0.14
IDENT
0.14
é¡Į
0.14
DisplayStyle
0.14
ków
0.14
Activations Density 0.201%