INDEX
Explanations
words associated with gradients and measurements of performance or risk
New Auto-Interp
Head Attr Weights
0:0.07
1:0.14
2:0.04
3:0.05
4:0.04
5:0.27
6:0.05
7:0.03
8:0.05
9:0.10
10:0.07
11:0.04
Negative Logits
nomination
-1.44
Born
-1.36
allegiance
-1.34
affiliation
-1.33
UD
-1.32
appearance
-1.31
Preferred
-1.31
endors
-1.30
reve
-1.29
ndra
-1.29
POSITIVE LOGITS
WARE
1.64
ipel
1.60
ゴ
1.59
Balt
1.56
PLIED
1.49
istg
1.44
Grimoire
1.40
pandemonium
1.37
ograp
1.37
sqor
1.36
Activations Density 0.014%