INDEX
Explanations
asking for clarification or offers of help
New Auto-Interp
Negative Logits
no
0.66
ignoring
0.64
must
0.64
enemy
0.58
murderous
0.58
tyranny
0.56
obeyed
0.56
oppressive
0.56
delusion
0.55
doomed
0.54
POSITIVE LOGITS
Fragen
1.08
Questions
1.07
preguntas
1.03
informacje
1.00
questions
1.00
inquiries
0.99
질문
0.99
Questions
0.99
summaries
0.97
sorular
0.96
Activations Density 5.554%