INDEX
Explanations
phrases that express negation or dismissal, particularly using the term "never."
New Auto-Interp
Negative Logits
aises
-0.14
ught
-0.14
enticator
-0.14
sj
-0.13
arty
-0.13
umph
-0.13
óÅĤ
-0.13
avatars
-0.13
istical
-0.13
ROP
-0.13
POSITIVE LOGITS
mind
0.30
mind
0.27
winter
0.24
underestimate
0.23
ending
0.23
land
0.22
Mind
0.21
Ending
0.21
trust
0.21
endum
0.20
Activations Density 0.020%