INDEX
Explanations
expressions of moral judgment regarding actions and societal norms
New Auto-Interp
Negative Logits
AssemblyCulture
-0.79
IntoConstraints
-0.78
parsedMessage
-0.75
متعلقه
-0.75
webElementXpaths
-0.74
ValueStyle
-0.73
utafitiHapana
-0.72
ódó
-0.71
хьтан
-0.69
виправивши
-0.68
POSITIVE LOGITS
people
0.79
sometimes
0.66
stereotypes
0.65
stereotype
0.64
people
0.60
ignorant
0.59
有些人
0.59
often
0.59
misunderstand
0.57
Often
0.57
Activations Density 0.799%