INDEX
Explanations
references to the word "Elephant"
references to elephants
New Auto-Interp
Negative Logits
sburgh
-0.85
Kenobi
-0.74
raints
-0.73
DERR
-0.71
Responsibility
-0.70
lain
-0.70
aldehyde
-0.69
Papers
-0.69
Sakuya
-0.66
Hilton
-0.66
POSITIVE LOGITS
venth
1.46
phant
1.28
fter
0.92
oton
0.90
ven
0.88
ph
0.85
lect
0.84
ITH
0.83
LECT
0.82
azar
0.81
Activations Density 0.025%