INDEX
Explanations
expressions that convey contrast or opposition in statements
New Auto-Interp
Negative Logits
erland
-0.18
EMPLARY
-0.17
loor
-0.16
lew
-0.16
824
-0.15
atee
-0.15
ERO
-0.15
neod
-0.14
_Tis
-0.14
ospace
-0.14
POSITIVE LOGITS
vice
1.06
Vice
0.85
vice
0.82
reverse
0.65
reverse
0.56
Conversely
0.55
converse
0.54
Reverse
0.52
Reverse
0.52
VICE
0.51
Activations Density 0.126%