INDEX
Explanations
phrases related to the philosophical or moral implications of actions
New Auto-Interp
Negative Logits
addCriterion
-0.18
ầy
-0.16
alg
-0.15
beros
-0.15
ovsky
-0.15
ÙĦÙ쨩
-0.14
iat
-0.14
æĢĿãģĦ
-0.14
abr
-0.13
Yar
-0.13
POSITIVE LOGITS
Factory
0.17
Factory
0.15
/or
0.14
Giles
0.14
factory
0.14
irth
0.14
rics
0.14
Phelps
0.13
Aviv
0.13
factory
0.13
Activations Density 1.023%