INDEX
Explanations
phrases related to courage or the act of taking risks
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.16
3:0.05
4:0.11
5:0.02
6:0.04
7:0.32
8:0.03
9:0.03
10:0.07
11:0.08
Negative Logits
Causes
-1.72
interesting
-1.55
eteenth
-1.52
�
-1.50
priority
-1.45
ム
-1.42
leness
-1.39
ciation
-1.37
Lin
-1.37
aber
-1.36
POSITIVE LOGITS
tee
1.56
bunker
1.51
kosher
1.47
blackmail
1.46
withdraw
1.42
backdoor
1.41
blindly
1.41
handc
1.40
stra
1.40
toe
1.39
Activations Density 0.034%