INDEX
Explanations
questions starting with "How does" or "Does."
New Auto-Interp
Negative Logits
hig
-0.73
bis
-0.68
xon
-0.68
iken
-0.68
arer
-0.65
bsp
-0.64
ullivan
-0.64
lla
-0.62
iem
-0.61
psc
-0.61
POSITIVE LOGITS
?!
1.15
?
1.13
?]
1.12
?),
1.11
?)
1.10
?!"
1.08
?"
1.04
?:
1.02
?).
1.02
?",
1.01
Activations Density 0.421%