INDEX
Explanations
selecting alternatives and preferences
New Auto-Interp
Negative Logits
importantly
0.65
重要的是
0.64
важно
0.63
belangrijk
0.63
wichtig
0.58
중요
0.57
viktigt
0.56
penting
0.55
IMPORTANT
0.55
ważne
0.55
POSITIVE LOGITS
Prefer
0.73
prefer
0.72
尽量
0.66
代わりに
0.66
Prefer
0.65
Instead
0.64
предпочита
0.64
prefer
0.63
try
0.63
lieber
0.62
Activations Density 0.006%