INDEX
Explanations
phrases related to perception and experience
New Auto-Interp
Head Attr Weights
0:0.01
1:0.05
2:0.14
3:0.08
4:0.02
5:0.10
6:0.12
7:0.11
8:0.10
9:0.05
10:0.08
11:0.08
Negative Logits
ivating
-1.12
ivated
-1.06
ascus
-1.05
ription
-1.04
sidx
-1.03
added
-0.99
leaflets
-0.99
Loading
-0.98
aucus
-0.97
asio
-0.96
POSITIVE LOGITS
prope
0.99
ulhu
0.97
fireball
0.96
Deus
0.96
',"
0.96
}.
0.95
…
0.94
!/
0.93
$$
0.93
Doll
0.91
Activations Density 0.113%