INDEX
Explanations
phrases indicating novelty or uniqueness
New Auto-Interp
Negative Logits
á»ijt
-0.20
Actual
-0.18
ê´
-0.15
ameleon
-0.14
ãĤį
-0.14
itte
-0.14
[~,
-0.14
Actual
-0.14
zeug
-0.14
ลà¸ĩ
-0.14
POSITIVE LOGITS
previously
0.34
elsewhere
0.31
otherwise
0.31
previous
0.26
seen
0.25
Previously
0.25
seen
0.24
else
0.24
otherwise
0.24
before
0.23
Activations Density 0.094%