INDEX
Explanations
questions and inquiries throughout the document
New Auto-Interp
Negative Logits
confir
-0.75
wrists
-0.74
retri
-0.73
discharged
-0.67
oute
-0.67
offending
-0.67
stret
-0.67
advoc
-0.66
enf
-0.66
straps
-0.65
POSITIVE LOGITS
Experts
1.05
Why
1.00
Vote
0.97
Answer
0.95
Lessons
0.95
Isn
0.95
Nope
0.95
Find
0.94
Recent
0.93
Debate
0.93
Activations Density 0.044%