INDEX
Explanations
phrases indicating support or affirmation
New Auto-Interp
Negative Logits
Rouge
-0.16
anch
-0.15
DM
-0.15
eka
-0.14
UG
-0.14
conj
-0.14
ATS
-0.14
favor
-0.14
ad
-0.14
eness
-0.14
POSITIVE LOGITS
backing
0.32
Backing
0.27
/back
0.24
backed
0.23
backs
0.21
-backed
0.21
(back
0.20
=back
0.20
haul
0.19
aret
0.18
Activations Density 0.015%