INDEX
Explanations
phrases about decision-making and agency
New Auto-Interp
Negative Logits
alus
-0.13
ãģ«ãĤĤ
-0.13
Sole
-0.13
unb
-0.13
iero
-0.13
ebe
-0.13
itudes
-0.12
iances
-0.12
justified
-0.12
thinkable
-0.12
POSITIVE LOGITS
leave
0.51
Leave
0.47
Leave
0.45
leaving
0.43
leave
0.43
let
0.42
letting
0.39
leaves
0.37
wait
0.34
LET
0.33
Activations Density 0.359%