INDEX
Explanations
words related to explicit statements or instructions
phrases that mention explicitness or clarity in statements
New Auto-Interp
Negative Logits
Tycoon
-0.87
nesota
-0.78
«ĺ
-0.77
Squ
-0.77
STON
-0.72
rug
-0.71
Royale
-0.70
busters
-0.70
ADS
-0.68
Score
-0.67
POSITIVE LOGITS
deline
0.81
guiActiveUn
0.79
explicit
0.79
ities
0.78
disclaim
0.77
textual
0.77
explicitly
0.77
disav
0.75
prohibitions
0.75
repud
0.73
Activations Density 0.030%