INDEX
Explanations
expressions of knowledge and awareness
New Auto-Interp
Head Attr Weights
0:0.12
1:0.12
2:0.03
3:0.04
4:0.04
5:0.22
6:0.03
7:0.03
8:0.14
9:0.07
10:0.06
11:0.05
Negative Logits
termination
-1.55
consolidation
-1.54
merging
-1.53
remaining
-1.51
terminating
-1.46
merger
-1.40
rollout
-1.40
failure
-1.39
collapsing
-1.39
continuation
-1.37
POSITIVE LOGITS
ESSION
1.48
idian
1.47
TextColor
1.45
��
1.44
OOD
1.43
hour
1.42
hours
1.38
Professor
1.36
mosp
1.36
GoldMagikarp
1.34
Activations Density 0.014%