INDEX
Explanations
achieve goals or discover ways
New Auto-Interp
Negative Logits
hold
0.88
hold
0.75
Hold
0.68
Hold
0.66
HOLD
0.61
rid
0.53
HOLD
0.51
持
0.49
halten
0.46
HOL
0.45
POSITIVE LOGITS
rozpozn
0.45
odkry
0.43
discovered
0.41
Discovery
0.41
découvert
0.40
Discover
0.38
discovered
0.38
desco
0.38
Cannot
0.38
scoperta
0.38
Activations Density 0.006%