INDEX
Explanations
words associated with correctness or justification
terms related to fairness and appropriateness in various contexts
New Auto-Interp
Negative Logits
clos
-0.66
iments
-0.66
Alam
-0.65
dolls
-0.65
ession
-0.63
Volunteers
-0.63
shirts
-0.63
ema
-0.62
fertility
-0.60
Football
-0.60
POSITIVE LOGITS
ãĤ©
0.96
rightfully
0.95
deserved
0.92
rightly
0.90
é¾į
0.85
è¯
0.78
outweigh
0.77
ãĥ£
0.75
æĺ¯
0.74
eous
0.73
Activations Density 0.013%