INDEX
Explanations
phrases indicating effort or difficulty
New Auto-Interp
Head Attr Weights
0:0.09
1:0.01
2:0.27
3:0.09
4:0.12
5:0.08
6:0.02
7:0.05
8:0.05
9:0.05
10:0.08
11:0.04
Negative Logits
uart
-1.29
eters
-1.28
olor
-1.23
lique
-1.22
multi
-1.22
represent
-1.22
ysis
-1.21
ulet
-1.18
lin
-1.14
pur
-1.12
POSITIVE LOGITS
方
1.36
Mellon
1.35
ifiable
1.22
hetto
1.15
luck
1.11
speeding
1.07
ォ
1.07
Fenrir
1.05
darn
1.03
Huck
1.03
Activations Density 0.082%