INDEX
Explanations
phrases indicating contradictions, complexities, and societal frustrations
New Auto-Interp
Head Attr Weights
0:0.02
1:0.04
2:0.12
3:0.04
4:0.01
5:0.03
6:0.13
7:0.09
8:0.07
9:0.06
10:0.09
11:0.26
Negative Logits
Alone
-1.20
consent
-1.16
mins
-1.10
xton
-1.07
anymore
-1.06
twitch
-1.05
Desktop
-1.04
immunity
-1.00
debian
-1.00
alone
-0.99
POSITIVE LOGITS
than
1.90
than
1.82
Than
1.42
eem
1.32
!--
1.28
Reviewer
1.28
VERTISEMENT
1.27
ゼ
1.27
!).
1.25
ワ
1.25
Activations Density 0.036%