INDEX
Explanations
references to self-awareness and personal agency
New Auto-Interp
Negative Logits
ana
-0.15
aver
-0.15
.invalidate
-0.15
Sil
-0.15
Moral
-0.14
ãĤ·ãĥ¼
-0.14
Hoff
-0.14
asury
-0.14
enders
-0.14
612
-0.14
POSITIVE LOGITS
页éĿ¢åŃĺæ¡£å¤ĩ份
0.19
inel
0.15
ÌĨ
0.14
wij
0.14
ảo
0.14
iffies
0.14
uger
0.14
untime
0.14
ëį°
0.14
tục
0.14
Activations Density 0.838%