INDEX
Explanations
references to "ponies" and related terms
New Auto-Interp
Negative Logits
ODE
-0.15
gne
-0.15
llib
-0.15
oq
-0.15
iales
-0.15
аÑĢод
-0.14
erç
-0.14
esk
-0.14
ulses
-0.14
eks
-0.14
POSITIVE LOGITS
pon
0.27
Pon
0.25
pon
0.24
pons
0.19
emon
0.18
entially
0.16
yp
0.16
ymous
0.16
poz
0.16
pok
0.15
Activations Density 0.007%