INDEX
Explanations
questions about motivation or cause
New Auto-Interp
Negative Logits
alles
1.01
ANYTHING
0.95
anything
0.94
גם
0.88
也
0.85
Anything
0.84
YNAM
0.84
logros
0.84
tudo
0.82
extravaganza
0.82
POSITIVE LOGITS
these
0.85
use
0.84
These
0.78
použití
0.73
these
0.72
হল
0.72
让
0.71
θη
0.70
вих
0.69
ili
0.69
Activations Density 0.128%