INDEX
Explanations
explicit answers or responses
instances of questions being answered
New Auto-Interp
Negative Logits
zinski
-0.73
gotten
-0.72
ammy
-0.70
wana
-0.70
bats
-0.70
ovie
-0.70
robat
-0.69
erker
-0.69
chin
-0.67
heric
-0.67
POSITIVE LOGITS
ysis
1.19
answer
1.15
answer
1.14
answ
1.05
swers
0.99
Answer
0.97
Answer
0.96
answering
0.94
answered
0.91
Answers
0.84
Activations Density 0.016%