INDEX
Explanations
sentences that indicate statements or declarations
New Auto-Interp
Negative Logits
utter
-0.82
installments
-0.69
royalty
-0.67
transact
-0.66
monopol
-0.65
delusion
-0.65
expense
-0.65
closet
-0.64
silly
-0.64
victories
-0.64
POSITIVE LOGITS
Additionally
0.94
However
0.90
Similarly
0.88
<|endoftext|>
0.88
Furthermore
0.88
Moreover
0.87
Along
0.85
Adding
0.85
Instead
0.85
Afterwards
0.85
Activations Density 0.289%