INDEX
Explanations
statements expressing differing levels of correctness or morality, along with recommendations or judgments about actions
phrases that express mistakes or wrongdoings related to moral or ethical judgments
New Auto-Interp
Negative Logits
strengths
-0.70
lator
-0.68
reperto
-0.67
aukee
-0.63
calmed
-0.63
linem
-0.63
rapport
-0.62
hani
-0.62
uli
-0.61
delightful
-0.61
POSITIVE LOGITS
underestimate
0.88
knowingly
0.83
anymore
0.81
presume
0.80
oppose
0.79
condone
0.78
whatsoever
0.77
anyone
0.75
accuse
0.75
impose
0.75
Activations Density 0.207%