INDEX
Explanations
instances of self-contradiction and arguments about morality and values
New Auto-Interp
Negative Logits
erp
-0.16
Morales
-0.16
ermen
-0.16
sov
-0.16
dle
-0.15
IRT
-0.15
expend
-0.14
еÑĢп
-0.14
ç´
-0.14
loe
-0.14
POSITIVE LOGITS
gfx
0.15
opia
0.14
ano
0.14
clip
0.14
Alic
0.14
Pl
0.14
Escape
0.14
Tu
0.14
Escape
0.14
amel
0.14
Activations Density 0.422%