INDEX
Explanations
phrases relating to criticism or analysis of societal issues
New Auto-Interp
Head Attr Weights
0:0.03
1:0.02
2:0.25
3:0.14
4:0.15
5:0.06
6:0.03
7:0.04
8:0.07
9:0.05
10:0.06
11:0.05
Negative Logits
flanked
-1.67
paused
-1.48
glanced
-1.45
ertodd
-1.39
urgently
-1.37
gently
-1.36
respectful
-1.35
nodded
-1.34
promptly
-1.34
coordinating
-1.33
POSITIVE LOGITS
nor
2.17
predecessors
2.03
counterparts
1.76
predecessor
1.76
>.
1.73
attRot
1.71
SPONSORED
1.68
anymore
1.62
ndra
1.55
Nor
1.55
Activations Density 0.489%