INDEX
Explanations
phrases related to non-responsiveness or denials
New Auto-Interp
Negative Logits
righteousness
-0.75
injust
-0.70
Rebellion
-0.68
mediocre
-0.68
Empires
-0.66
proportions
-0.65
Levels
-0.65
Conversation
-0.65
Ribbon
-0.64
horizont
-0.63
POSITIVE LOGITS
disclose
1.11
disclosed
1.08
specify
1.06
formally
1.03
divul
0.94
mention
0.91
elabor
0.91
disclosing
0.91
necessarily
0.90
explain
0.90
Activations Density 0.088%