INDEX
Explanations
phrases that indicate significance or importance
New Auto-Interp
Negative Logits
tidy
-0.15
next
-0.15
381
-0.14
NEXT
-0.14
olle
-0.14
latest
-0.14
Next
-0.14
_FP
-0.13
get
-0.13
iani
-0.13
POSITIVE LOGITS
fact
0.23
way
0.21
kind
0.20
apart
0.17
kind
0.17
regard
0.17
reason
0.16
matter
0.16
factor
0.16
fact
0.16
Activations Density 0.147%