INDEX
Explanations
terms related to personal suffering or discomfort
New Auto-Interp
Head Attr Weights
0:0.09
1:0.07
2:0.08
3:0.09
4:0.07
5:0.07
6:0.07
7:0.07
8:0.09
9:0.09
10:0.08
11:0.06
Negative Logits
ewitness
-2.38
renheit
-2.25
omach
-2.23
arted
-2.17
irteen
-2.16
aniel
-2.14
untarily
-2.12
ーティ
-2.11
resy
-2.11
emort
-2.09
POSITIVE LOGITS
crop
2.07
clus
2.04
incentive
1.98
BLM
1.97
bip
1.97
contribut
1.95
combo
1.95
bloc
1.95
AB
1.94
divest
1.94
Activations Density 0.000%