INDEX
Explanations
references to academic papers or research articles
mentions of research papers or academic publications
New Auto-Interp
Negative Logits
alez
-0.87
akening
-0.76
aren
-0.64
endor
-0.62
cffffcc
-0.62
iak
-0.60
ostic
-0.59
rt
-0.59
eal
-0.59
rogens
-0.58
POSITIVE LOGITS
Paper
1.10
clip
1.01
towels
0.88
paper
0.87
Paper
0.87
papers
0.84
flies
0.78
papers
0.76
towel
0.76
books
0.76
Activations Density 0.013%