INDEX
Explanations
words related to emotional or physical suffering
New Auto-Interp
Negative Logits
tual
-0.15
lesi
-0.15
ekli
-0.14
.ca
-0.14
доÑĢож
-0.13
TC
-0.13
ìļ°ë¦¬
-0.13
å¯
-0.13
ìĨĶ
-0.13
ereum
-0.13
POSITIVE LOGITS
umbo
0.16
394
0.15
нÑıÑĤ
0.15
396
0.15
ILD
0.15
IBE
0.15
pir
0.14
idor
0.14
avax
0.13
IDS
0.13
Activations Density 0.006%