INDEX
Explanations
affirmative phrases and statements related to self-awareness and acknowledgment
New Auto-Interp
Negative Logits
"];
-0.70
!")
-0.68
متعلقه
-0.68
"):
-0.67
'];
-0.64
//
-0.64
"]).
-0.63
")));
-0.63
&___
-0.63
()]
-0.63
POSITIVE LOGITS
disagree
0.59
apples
0.51
Distribuzione
0.51
disprove
0.49
oike
0.48
facts
0.47
argument
0.47
rebuttal
0.47
impianto
0.47
事實
0.47
Activations Density 0.448%