INDEX
Explanations
phrases that denote inclusion or offering various options
New Auto-Interp
Negative Logits
mt
-0.17
adol
-0.16
illes
-0.15
ove
-0.15
ubb
-0.15
spanking
-0.14
entic
-0.14
promotion
-0.14
prom
-0.14
aste
-0.14
POSITIVE LOGITS
eltas
0.17
Bylo
0.16
åį
0.16
uran
0.15
Porno
0.15
åľĴ
0.15
ÅĤaw
0.15
gün
0.14
elocity
0.14
ReuseIdentifier
0.14
Activations Density 0.226%