INDEX
Explanations
references to societal issues and norms
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.11
3:0.13
4:0.06
5:0.03
6:0.05
7:0.03
8:0.27
9:0.11
10:0.08
11:0.04
Negative Logits
Schne
-1.65
Jagu
-1.57
Unch
-1.57
Haas
-1.52
FTWARE
-1.51
ADRA
-1.49
Hond
-1.48
succession
-1.44
Jere
-1.44
jarring
-1.43
POSITIVE LOGITS
0010
2.07
cms
2.06
english
2.03
Reviewer
1.92
vor
1.77
サーティワン
1.77
odi
1.75
nr
1.72
rw
1.69
widget
1.68
Activations Density 0.103%