INDEX
Explanations
discussions about the treatment of individuals, particularly in relation to equality and respect across different contexts
New Auto-Interp
Negative Logits
defaultstate
-0.41
優れた
-0.38
[*]
-0.36
effective
-0.36
readily
-0.35
tarvit
-0.35
чудо
-0.35
expérimentés
-0.35
ⓧ
-0.35
valid
-0.34
POSITIVE LOGITS
differently
1.39
correctly
0.82
accordingly
0.82
incorrectly
0.80
correctly
0.74
diffé
0.73
differ
0.73
similarly
0.71
Differ
0.71
autrement
0.66
Activations Density 0.582%