INDEX
Explanations
phrases indicating future intentions or possibilities
New Auto-Interp
Negative Logits
ward
-0.18
zug
-0.17
747
-0.17
223
-0.16
y
-0.16
ro
-0.15
rait
-0.15
.tc
-0.15
Herr
-0.15
145
-0.14
POSITIVE LOGITS
ONES
0.16
ê²½
0.16
WP
0.15
onne
0.15
expected
0.15
ying
0.14
tes
0.14
expected
0.14
etimes
0.14
illard
0.14
Activations Density 0.050%