INDEX
Explanations
phrases indicating negation or absence
New Auto-Interp
Head Attr Weights
0:0.03
1:0.07
2:0.13
3:0.08
4:0.01
5:0.04
6:0.10
7:0.06
8:0.06
9:0.11
10:0.13
11:0.14
Negative Logits
bats
-1.20
Researchers
-1.01
roup
-0.99
fuck
-0.97
�
-0.96
swung
-0.96
antis
-0.95
ventures
-0.95
Sl
-0.94
Talks
-0.93
POSITIVE LOGITS
justification
1.08
evidence
1.07
ayne
1.06
ambiguity
1.03
tampering
1.00
tera
1.00
escaping
0.98
satisfactory
0.98
��
0.98
inconsistency
0.96
Activations Density 0.043%