INDEX
Explanations
the presence of conjunctions or phrases indicating contrast
New Auto-Interp
Head Attr Weights
0:0.08
1:0.08
2:0.09
3:0.08
4:0.09
5:0.07
6:0.08
7:0.08
8:0.09
9:0.06
10:0.07
11:0.07
Negative Logits
Cosmetic
-1.99
renamed
-1.89
warr
-1.87
banners
-1.83
behav
-1.83
$.
-1.82
Uniform
-1.81
epit
-1.80
Proud
-1.80
;;;;;;;;;;;;
-1.80
POSITIVE LOGITS
Todd
2.19
ubes
2.19
olars
2.01
nos
1.99
igma
1.98
hare
1.92
mares
1.91
sites
1.84
tails
1.84
inity
1.83
Activations Density 0.000%