INDEX
Explanations
words related to taking actions, particularly in a regulatory or enforcement context
New Auto-Interp
Negative Logits
oran
-0.16
šov
-0.16
972
-0.14
vard
-0.14
orning
-0.14
603
-0.14
617
-0.14
605
-0.14
kalk
-0.14
strap
-0.14
POSITIVE LOGITS
steps
0.33
concrete
0.24
Steps
0.24
firm
0.23
measures
0.23
steps
0.23
Steps
0.22
appropriate
0.22
necessary
0.21
fir
0.21
Activations Density 0.036%