INDEX
Explanations
terms related to crime and punishment
phrases related to social dynamics and interactions
New Auto-Interp
Negative Logits
(?,
-0.53
orem
-0.49
.",
-0.49
ukong
-0.45
venants
-0.44
owered
-0.44
yssey
-0.44
bilt
-0.43
arij
-0.43
aturday
-0.43
POSITIVE LOGITS
!).
0.58
?).
0.51
).[
0.51
)?
0.50
)—
0.48
phr
0.45
!)
0.44
)[
0.44
-)
0.43
).
0.43
Activations Density 3.321%