INDEX
Explanations
phrases that indicate omitted information or facts
New Auto-Interp
Head Attr Weights
0:0.01
1:0.01
2:0.08
3:0.05
4:0.14
5:0.02
6:0.05
7:0.41
8:0.04
9:0.03
10:0.06
11:0.06
Negative Logits
anic
-1.66
orses
-1.65
rha
-1.60
lav
-1.60
wagen
-1.55
ivot
-1.53
yrinth
-1.53
oros
-1.51
rg
-1.51
oled
-1.50
POSITIVE LOGITS
altogether
2.01
anymore
1.80
distinctions
1.64
jokes
1.62
incidentally
1.54
redund
1.47
because
1.44
comparisons
1.44
Doodle
1.42
comparison
1.41
Activations Density 0.004%