INDEX
Explanations
phrases related to confrontation and personal accountability in discussions
New Auto-Interp
Head Attr Weights
0:0.01
1:0.03
2:0.08
3:0.06
4:0.02
5:0.03
6:0.06
7:0.09
8:0.25
9:0.08
10:0.11
11:0.12
Negative Logits
iggins
-1.18
elo
-1.16
�
-1.09
prise
-1.09
�
-1.08
ipel
-1.04
adelphia
-1.01
eger
-1.00
ás
-1.00
alg
-0.99
POSITIVE LOGITS
hadn
1.14
');
1.12
')
1.09
clicked
1.02
discriminated
1.01
'),
1.00
existed
1.00
didnt
0.99
lied
0.96
chwitz
0.95
Activations Density 0.003%