INDEX
Explanations
references to fairness or equity in various contexts
New Auto-Interp
Negative Logits
elho
-0.17
pheres
-0.16
owo
-0.15
obia
-0.15
eve
-0.15
ĤŃ
-0.15
endor
-0.14
gia
-0.14
sWith
-0.14
stag
-0.14
POSITIVE LOGITS
yt
0.39
ground
0.34
weather
0.33
grounds
0.33
er
0.28
ies
0.27
fax
0.27
child
0.26
play
0.25
banks
0.25
Activations Density 0.029%