INDEX
Explanations
phrases related to conflicts or struggles
New Auto-Interp
Head Attr Weights
0:0.07
1:0.01
2:0.04
3:0.24
4:0.02
5:0.06
6:0.02
7:0.04
8:0.02
9:0.02
10:0.38
11:0.01
Negative Logits
respectively
-2.19
SPONSORED
-2.15
graduates
-2.08
contrasted
-2.08
counterparts
-1.94
contrasts
-1.91
backdrop
-1.90
reclaimed
-1.87
版
-1.86
�
-1.86
POSITIVE LOGITS
groove
2.87
ASAP
2.25
contrace
2.03
funk
2.02
enthus
1.94
smoking
1.88
Phones
1.86
trouble
1.85
Smoking
1.84
backdoor
1.82
Activations Density 0.043%