INDEX
Explanations
words or phrases related to negation or absence
New Auto-Interp
Negative Logits
iant
-0.16
dale
-0.15
nen
-0.15
onest
-0.15
εί
-0.15
eline
-0.14
ernetes
-0.14
Macros
-0.14
reet
-0.14
esto
-0.14
POSITIVE LOGITS
olian
0.24
ither
0.22
lect
0.21
aten
0.20
vents
0.19
xp
0.19
asier
0.19
ager
0.19
uron
0.19
vidence
0.18
Activations Density 0.051%