INDEX
Explanations
references to specific brands or products, especially in a negative context
New Auto-Interp
Negative Logits
ÑĪÑĤ
-0.18
ctors
-0.15
ERO
-0.15
ÑģÑĤиÑĤ
-0.15
noteq
-0.15
cloak
-0.15
ÑģÑĤÑĢа
-0.15
sher
-0.15
amat
-0.14
ξι
-0.14
POSITIVE LOGITS
Hy
0.18
pol
0.17
Pol
0.17
Pol
0.16
hy
0.15
hydr
0.15
-pol
0.15
Desired
0.14
iesel
0.14
¬
0.14
Activations Density 0.025%