INDEX
Explanations
references to moderation or moderate concepts in various contexts
New Auto-Interp
Negative Logits
eger
-0.16
nat
-0.16
ings
-0.16
forge
-0.16
fully
-0.15
roperty
-0.15
ÑĤÑĶ
-0.14
ajan
-0.14
istics
-0.14
udi
-0.14
POSITIVE LOGITS
(Mod
0.20
/mod
0.19
(mod
0.18
.MOD
0.18
eterminate
0.17
amente
0.17
ded
0.17
.mods
0.17
éc
0.16
igli
0.16
Activations Density 0.030%