INDEX
Explanations
phrases or questions that express expectations or propose actions
New Auto-Interp
Head Attr Weights
0:0.05
1:0.03
2:0.12
3:0.26
4:0.07
5:0.06
6:0.02
7:0.13
8:0.07
9:0.02
10:0.07
11:0.04
Negative Logits
ォ
-2.68
ée
-2.51
poral
-2.50
ocular
-2.45
ixt
-2.40
Illum
-2.29
スト
-2.29
��
-2.28
ixture
-2.27
ゴン
-2.26
POSITIVE LOGITS
idiots
3.51
devs
3.42
admins
3.29
doesnt
3.22
blackmail
3.19
downgrade
3.07
crap
3.06
incompetence
3.03
incentives
2.92
incentiv
2.90
Activations Density 1.351%