INDEX
Explanations
references to historical injustices and their societal impact
New Auto-Interp
Negative Logits
æ´¥
-0.17
sm
-0.14
Cro
-0.14
cro
-0.14
bir
-0.14
danger
-0.14
unbind
-0.14
vice
-0.13
еÑĢж
-0.13
ζί
-0.13
POSITIVE LOGITS
iesen
0.18
aryl
0.15
oodle
0.14
cek
0.14
Ïįν
0.14
OUTH
0.14
áºŃu
0.13
istring
0.13
flation
0.13
raries
0.13
Activations Density 0.333%