INDEX
Explanations
phrases related to arguing or debating
New Auto-Interp
Negative Logits
Cosponsors
-0.87
til
-0.77
marked
-0.73
ned
-0.72
DragonMagazine
-0.71
falls
-0.70
typ
-0.70
iste
-0.67
ledged
-0.67
maker
-0.63
POSITIVE LOGITS
preserving
1.19
avoiding
1.03
keeping
0.99
protecting
0.98
accuracy
0.96
fairness
0.96
sanity
0.94
maintaining
0.93
realism
0.92
secrecy
0.92
Activations Density 0.053%