INDEX
Explanations
questions or prompts for answers, often involving a specific task or information
words related to responding to questions or inquiries
New Auto-Interp
Negative Logits
chin
-0.77
heric
-0.75
zinski
-0.73
Vengeance
-0.69
akin
-0.68
Nanto
-0.68
robat
-0.67
gotten
-0.66
ufact
-0.66
nered
-0.64
POSITIVE LOGITS
ysis
1.00
answer
0.89
questions
0.88
answ
0.87
Questions
0.84
swers
0.83
answering
0.82
yes
0.80
Answer
0.77
question
0.77
Activations Density 0.020%