INDEX
Explanations
words and prefixes related to the concept of "un" or negation
New Auto-Interp
Negative Logits
ade
-0.20
Entr
-0.15
ses
-0.15
обÑĢаз
-0.15
ebek
-0.15
whel
-0.14
_PAD
-0.14
antis
-0.14
efficient
-0.14
favor
-0.14
POSITIVE LOGITS
wav
0.22
wa
0.22
erring
0.22
wav
0.21
w
0.21
waiver
0.21
ending
0.20
uestion
0.19
fal
0.19
equiv
0.18
Activations Density 0.030%