INDEX
Explanations
critical or negative statements from a variety of domains or contexts
expressions of opinion or criticism in dialogue
New Auto-Interp
Negative Logits
Pg
-0.61
+.
-0.58
eligible
-0.57
Reviewed
-0.57
rupal
-0.54
iden
-0.52
adra
-0.52
antic
-0.50
cum
-0.49
ordes
-0.48
POSITIVE LOGITS
%"
1.21
)",
1.02
â̦"
0.97
"—
0.95
"]
0.94
.")
0.94
..."
0.91
")
0.90
)"
0.90
,"
0.89
Activations Density 1.716%