INDEX
Explanations
instances of the prefix "un" that convey a sense of negation or undesirability
New Auto-Interp
Negative Logits
wards
-0.15
BN
-0.15
hq
-0.15
eval
-0.15
usercontent
-0.15
oir
-0.14
erior
-0.14
sle
-0.14
iske
-0.14
xb
-0.14
POSITIVE LOGITS
lesi
0.18
elcome
0.18
Colomb
0.17
desirable
0.16
è
0.16
old
0.16
reck
0.15
ounded
0.15
inky
0.15
inkle
0.15
Activations Density 0.005%