INDEX
Explanations
phrases indicating seriousness or significant concern
New Auto-Interp
Negative Logits
ugs
-0.17
inely
-0.16
olars
-0.16
oca
-0.15
ERRU
-0.15
abh
-0.15
esa
-0.14
ále
-0.14
avour
-0.14
ÏĮ
-0.14
POSITIVE LOGITS
likelihood
0.31
probability
0.29
honesty
0.25
intents
0.23
practical
0.23
honestly
0.23
cand
0.22
fairness
0.22
odds
0.21
accounts
0.21
Activations Density 0.036%