INDEX
Explanations
phrases indicating quantity or frequency
New Auto-Interp
Negative Logits
iors
-0.18
combe
-0.15
ceased
-0.15
illes
-0.15
oders
-0.14
confines
-0.14
orne
-0.14
hy
-0.14
ble
-0.14
ialis
-0.14
POSITIVE LOGITS
tery
0.26
to
0.21
nict
0.20
ting
0.19
fewer
0.19
tern
0.19
ãĤĵãģ©
0.18
tering
0.17
more
0.17
TA
0.16
Activations Density 0.032%