INDEX
Explanations
affirmations and expressions of self-identity
New Auto-Interp
Negative Logits
ulton
-0.18
heck
-0.16
име
-0.16
oled
-0.15
hea
-0.15
ij¸
-0.15
IRON
-0.15
eya
-0.15
ÏģÏī
-0.14
ylon
-0.14
POSITIVE LOGITS
now
0.16
eld
0.14
merely
0.14
atr
0.14
anda
0.14
thanks
0.14
uco
0.14
mere
0.14
fare
0.14
eig
0.14
Activations Density 0.002%