INDEX
Explanations
questions or prompts starting with the word "What"
questions beginning with the word "What."
New Auto-Interp
Negative Logits
apsed
-0.74
swick
-0.66
emp
-0.64
hew
-0.64
pic
-0.63
iva
-0.61
Lago
-0.61
ammed
-0.61
rolley
-0.59
abel
-0.59
POSITIVE LOGITS
do
0.94
does
0.94
kinds
0.91
distinguishes
0.89
determines
0.88
happens
0.87
motiv
0.84
qualifies
0.82
are
0.81
did
0.81
Activations Density 0.054%