INDEX
Explanations
interrogative phrases and questions related to feelings, actions, and moral dilemmas
New Auto-Interp
Head Attr Weights
0:0.08
1:0.04
2:0.03
3:0.08
4:0.05
5:0.09
6:0.05
7:0.02
8:0.19
9:0.29
10:0.01
11:0.02
Negative Logits
�士
-1.90
avour
-1.73
laun
-1.72
Beir
-1.61
lacked
-1.57
istration
-1.56
court
-1.54
Hansen
-1.53
Provided
-1.52
ラン
-1.49
POSITIVE LOGITS
compare
2.00
ilater
1.74
miracle
1.71
fry
1.69
reconcile
1.65
Explain
1.65
regress
1.64
ancial
1.63
phosph
1.62
metaphors
1.61
Activations Density 0.032%