INDEX
Explanations
words related to harboring, protection, and ambivalence towards responsibility or wrongdoing
New Auto-Interp
Head Attr Weights
0:0.02
1:0.01
2:0.07
3:0.07
4:0.14
5:0.03
6:0.05
7:0.36
8:0.04
9:0.04
10:0.05
11:0.06
Negative Logits
=>
-1.38
disappear
-1.36
imil
-1.33
Enlarge
-1.32
budget
-1.32
gey
-1.31
availability
-1.30
chopping
-1.30
merce
-1.30
ifully
-1.30
POSITIVE LOGITS
emotions
1.62
optimism
1.61
doubts
1.53
feelings
1.52
pent
1.52
sidx
1.51
�
1.46
emotion
1.45
spoilers
1.45
essim
1.44
Activations Density 0.001%