INDEX
Explanations
phrases related to actions and their moral implications
New Auto-Interp
Negative Logits
atter
-0.20
assa
-0.17
boz
-0.15
çħ
-0.15
erea
-0.15
Ã¥l
-0.15
еÑĤе
-0.15
)((((
-0.15
ä½³
-0.14
ÑĢÑĮ
-0.14
POSITIVE LOGITS
steps
0.16
performed
0.16
нам
0.15
Performed
0.15
Gron
0.15
hma
0.14
COPE
0.14
nyder
0.14
activities
0.14
tasks
0.14
Activations Density 0.375%