INDEX
Explanations
phrases expressing newness or innovation
New Auto-Interp
Negative Logits
col
-0.18
lopen
-0.14
edef
-0.14
darwin
-0.14
ader
-0.14
inke
-0.14
_activate
-0.14
Suff
-0.13
Cary
-0.13
605
-0.13
POSITIVE LOGITS
feature
0.19
novel
0.18
unique
0.17
nov
0.17
novelty
0.16
feature
0.16
twist
0.15
Unique
0.15
hower
0.15
Feature
0.14
Activations Density 0.094%