INDEX
Explanations
references to legal cases and courtroom-related vocabulary
New Auto-Interp
Negative Logits
thood
-0.86
Ò
-0.81
bg
-0.74
ceive
-0.74
perse
-0.74
����
-0.73
icia
-0.72
leeve
-0.71
imi
-0.70
aba
-0.70
POSITIVE LOGITS
oret
1.50
latter
1.35
biggest
1.27
resa
1.25
remainder
1.21
slightest
1.19
odore
1.18
majority
1.17
simplest
1.15
vast
1.15
Activations Density 3.765%