INDEX
Explanations
phrases that express strong opinions or recommendations
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.07
3:0.06
4:0.15
5:0.04
6:0.04
7:0.31
8:0.03
9:0.04
10:0.08
11:0.08
Negative Logits
MRI
-1.71
セ
-1.66
alter
-1.65
Dispatch
-1.60
igsaw
-1.50
rette
-1.50
916
-1.49
ushima
-1.45
eeper
-1.42
stay
-1.41
POSITIVE LOGITS
integrity
1.90
Flavoring
1.85
reputation
1.79
sugg
1.72
virtues
1.71
courage
1.63
colleg
1.62
behavi
1.60
demeanor
1.59
Surviv
1.59
Activations Density 0.001%