INDEX
Explanations
phrases that describe mechanisms or methods
New Auto-Interp
Negative Logits
avin
-0.15
boy
-0.15
hit
-0.14
gram
-0.14
shelf
-0.14
oard
-0.14
jug
-0.13
than
-0.13
ught
-0.13
SOM
-0.13
POSITIVE LOGITS
ioned
0.16
ród
0.16
ifu
0.16
angs
0.15
serrat
0.15
ufe
0.15
ÙĨÙĪÙģ
0.14
illas
0.14
ums
0.14
valuator
0.14
Activations Density 0.023%