INDEX
Explanations
phrases related to reasoning or conclusions
New Auto-Interp
Head Attr Weights
0:0.05
1:0.03
2:0.15
3:0.06
4:0.27
5:0.04
6:0.03
7:0.03
8:0.13
9:0.09
10:0.06
11:0.02
Negative Logits
lobb
-1.54
nesty
-1.33
vertisement
-1.32
lobby
-1.29
zilla
-1.28
breat
-1.26
aturdays
-1.24
ilyn
-1.24
breathe
-1.24
imposed
-1.23
POSITIVE LOGITS
��
1.54
��
1.44
��
1.43
�
1.40
CHAT
1.35
��
1.34
NEC
1.28
龍契士
1.26
��
1.26
Virtue
1.25
Activations Density 0.006%