INDEX
Explanations
negations or negative phrases in the text
New Auto-Interp
Negative Logits
latter
-0.18
iaux
-0.16
eless
-0.16
n
-0.15
z
-0.15
-sided
-0.15
h
-0.15
d
-0.14
Ñıб
-0.14
874
-0.14
POSITIVE LOGITS
ħn
0.15
/-
0.15
ÑįÑĤомÑĥ
0.15
Bris
0.14
rador
0.14
rası
0.13
ADOR
0.13
prav
0.13
ador
0.13
ÏĥÏħ
0.13
Activations Density 0.067%