INDEX
Explanations
words related to actions and verb forms
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.35
3:0.03
4:0.09
5:0.04
6:0.02
7:0.07
8:0.03
9:0.06
10:0.14
11:0.07
Negative Logits
imar
-1.44
arat
-1.28
Wyn
-1.20
Galile
-1.13
anqu
-1.08
ndra
-1.07
Breat
-1.03
Dough
-1.02
othy
-1.01
Hear
-1.01
POSITIVE LOGITS
cli
1.24
️
1.11
operation
1.11
SHIP
1.08
owder
1.02
ALLY
1.01
hedral
1.01
aign
1.01
(*
1.00
olicy
0.98
Activations Density 0.155%