INDEX
Explanations
actions that are performed exceptionally or with high success
New Auto-Interp
Negative Logits
cki
-0.18
út
-0.15
">//
-0.14
halt
-0.14
stairs
-0.14
adecimal
-0.14
ufe
-0.14
htable
-0.14
ovsky
-0.14
Fa
-0.14
POSITIVE LOGITS
fox
0.27
pace
0.26
smart
0.26
mus
0.25
gun
0.24
bid
0.24
score
0.23
strip
0.22
distance
0.22
match
0.22
Activations Density 0.010%