INDEX
Explanations
phrases indicating high quality or superiority
New Auto-Interp
Negative Logits
ment
-0.17
nt
-0.16
наÑĩе
-0.15
our
-0.15
ema
-0.14
zon
-0.14
abler
-0.14
anine
-0.14
/do
-0.14
(es
-0.14
POSITIVE LOGITS
-notch
0.22
most
0.19
OLON
0.18
-rated
0.17
oley
0.17
thora
0.16
cott
0.16
ogr
0.16
-secret
0.16
pest
0.15
Activations Density 0.061%