INDEX
Explanations
words related to welcoming or friendly interactions
New Auto-Interp
Head Attr Weights
0:0.07
1:0.02
2:0.21
3:0.07
4:0.17
5:0.03
6:0.02
7:0.02
8:0.13
9:0.15
10:0.04
11:0.01
Negative Logits
��
-1.59
��
-1.45
�
-1.42
�
-1.35
�
-1.33
��
-1.31
ById
-1.31
irez
-1.31
�
-1.31
�
-1.30
POSITIVE LOGITS
izons
1.68
oats
1.32
itiveness
1.30
estern
1.29
heels
1.28
pine
1.26
entin
1.25
welcoming
1.25
welcome
1.24
ard
1.24
Activations Density 0.015%