INDEX
Explanations
phrases that express varying degrees of comparison or judgment
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.16
3:0.09
4:0.28
5:0.02
6:0.06
7:0.10
8:0.04
9:0.04
10:0.06
11:0.06
Negative Logits
Bei
-1.61
taboola
-1.52
Cosponsors
-1.46
20439
-1.43
王
-1.38
fam
-1.36
Participant
-1.36
OURCE
-1.35
Story
-1.34
ettings
-1.34
POSITIVE LOGITS
squared
1.66
sizing
1.60
messing
1.59
ner
1.55
wrinkles
1.50
sloppy
1.44
bells
1.42
hormones
1.39
joking
1.38
tweaking
1.37
Activations Density 0.113%