INDEX
Explanations
expressions of decision-making and confidence
New Auto-Interp
Negative Logits
aji
-0.07
linger
-0.07
agner
-0.06
993
-0.06
ẫ
-0.06
enn
-0.06
STRICT
-0.06
wig
-0.06
WHATSOEVER
-0.06
WithURL
-0.06
POSITIVE LOGITS
correct
0.09
correctness
0.09
decisions
0.08
OK
0.07
direction
0.07
Correct
0.07
correct
0.07
justification
0.07
choices
0.07
ok
0.06
Activations Density 0.036%