INDEX
Explanations
explicit disclaimers in a text
assertions of opinion or truthfulness related to content
New Auto-Interp
Negative Logits
rants
-0.70
colm
-0.70
vati
-0.70
eness
-0.67
Roberts
-0.66
worthiness
-0.66
luaj
-0.65
LOS
-0.65
racuse
-0.63
hung
-0.63
POSITIVE LOGITS
approximate
1.07
unofficial
1.03
NOT
1.01
purely
0.99
subjective
0.98
tentative
0.97
fictitious
0.95
provisional
0.94
strictly
0.91
preliminary
0.91
Activations Density 0.191%