INDEX
Explanations
phrases that indicate moderation or management
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.04
3:0.06
4:0.13
5:0.02
6:0.03
7:0.38
8:0.03
9:0.03
10:0.13
11:0.06
Negative Logits
覚醒
-1.68
fulfilled
-1.68
paralle
-1.55
Catalog
-1.53
envision
-1.51
fulfillment
-1.49
retched
-1.45
rists
-1.44
士
-1.43
INFO
-1.42
POSITIVE LOGITS
moder
2.02
moderation
1.74
Dialogue
1.65
Atmosp
1.58
commenting
1.51
dialogue
1.49
democratically
1.46
debates
1.45
dissenting
1.45
debate
1.44
Activations Density 0.001%