INDEX
Explanations
phrases that indicate implications or consequences
New Auto-Interp
Head Attr Weights
0:0.02
1:0.01
2:0.08
3:0.05
4:0.12
5:0.02
6:0.02
7:0.45
8:0.02
9:0.03
10:0.06
11:0.06
Negative Logits
isans
-1.72
osponsors
-1.69
ubs
-1.58
chest
-1.52
Horses
-1.47
Hun
-1.46
Babe
-1.43
romy
-1.41
fur
-1.40
tsy
-1.38
POSITIVE LOGITS
implications
1.91
ramifications
1.86
directions
1.84
wellbeing
1.60
behavi
1.55
Relations
1.55
negatively
1.51
meanings
1.50
sustainability
1.49
Impl
1.48
Activations Density 0.006%