INDEX
Explanations
phrases related to explanations or clarity of concepts and actions
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.05
3:0.08
4:0.10
5:0.02
6:0.04
7:0.45
8:0.02
9:0.02
10:0.07
11:0.06
Negative Logits
elight
-2.08
ヘ
-1.73
emouth
-1.62
erity
-1.62
mouth
-1.54
cedented
-1.50
ibaba
-1.50
pilgr
-1.44
ngth
-1.43
rive
-1.41
POSITIVE LOGITS
why
2.09
WHY
2.08
why
2.07
aloud
1.93
gist
1.90
actionDate
1.84
intric
1.81
convoluted
1.81
complicated
1.79
misunderstand
1.79
Activations Density 0.054%