INDEX
Explanations
references to alternative explanations or viewpoints
New Auto-Interp
Negative Logits
ysters
-0.16
-floating
-0.16
okes
-0.15
IDE
-0.15
aday
-0.14
orns
-0.14
allon
-0.14
oupon
-0.14
agna
-0.13
Inlining
-0.13
POSITIVE LOGITS
words
0.28
words
0.22
respects
0.17
said
0.16
Words
0.16
.words
0.16
news
0.16
says
0.15
saying
0.15
cases
0.15
Activations Density 0.018%