INDEX
Explanations
physical actions and their consequences
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.11
3:0.07
4:0.14
5:0.03
6:0.03
7:0.34
8:0.04
9:0.04
10:0.07
11:0.05
Negative Logits
sonian
-2.38
priority
-1.76
eworks
-1.71
uilding
-1.68
cellent
-1.61
CLUS
-1.61
rative
-1.60
exclusive
-1.60
endars
-1.59
uments
-1.57
POSITIVE LOGITS
recoil
1.74
shy
1.59
overhe
1.55
overe
1.54
sudden
1.47
whiff
1.45
losing
1.40
Butt
1.39
footsteps
1.38
overly
1.37
Activations Density 0.001%