INDEX
Explanations
references to research papers
references to research papers and academic publications
New Auto-Interp
Negative Logits
alez
-0.88
akening
-0.70
Lowell
-0.64
endor
-0.64
eties
-0.60
xon
-0.59
Brittany
-0.58
Harmony
-0.58
ichita
-0.57
Sax
-0.57
POSITIVE LOGITS
Paper
1.21
clip
1.02
paper
0.97
flies
0.91
papers
0.89
Paper
0.89
towels
0.86
papers
0.84
paper
0.81
towel
0.79
Activations Density 0.014%