INDEX
Explanations
phrases indicating addition or inclusion
New Auto-Interp
Negative Logits
yah
-0.65
NING
-0.64
Zel
-0.60
ARE
-0.60
zing
-0.59
cdn
-0.58
THR
-0.58
grim
-0.57
mare
-0.57
rior
-0.56
POSITIVE LOGITS
endum
1.32
itional
1.18
ictions
1.14
ition
1.10
itions
1.09
ressing
1.09
itionally
1.07
resses
1.03
ictive
1.03
insult
1.02
Activations Density 0.543%