INDEX
Explanations
the word "only" in various contexts
New Auto-Interp
Negative Logits
aly
-0.16
spÄĽ
-0.15
اÙĦÙī
-0.15
æĥij
-0.15
ERM
-0.15
_categorical
-0.15
arges
-0.14
ább
-0.14
ophon
-0.14
arget
-0.14
POSITIVE LOGITS
thing
0.22
rarely
0.20
few
0.20
recently
0.18
when
0.18
after
0.17
fools
0.16
Thing
0.16
problem
0.16
when
0.16
Activations Density 0.023%