INDEX
Explanations
phrases related to contradictions or negations in context
New Auto-Interp
Head Attr Weights
0:0.03
1:0.02
2:0.41
3:0.06
4:0.06
5:0.03
6:0.13
7:0.02
8:0.04
9:0.03
10:0.05
11:0.05
Negative Logits
Gael
-1.99
iHUD
-1.73
itus
-1.57
Kills
-1.53
ogl
-1.51
qualification
-1.50
Rain
-1.49
ビ
-1.49
proportions
-1.43
Scope
-1.43
POSITIVE LOGITS
themselves
2.32
selves
2.21
selves
1.98
ently
1.75
THEIR
1.74
undai
1.70
okers
1.70
itimate
1.66
tten
1.63
abundantly
1.62
Activations Density 0.055%