INDEX
Explanations
mentions or references to being prohibited or excluded from something
words related to restrictions or prohibitions
New Auto-Interp
Negative Logits
downt
-0.73
Writer
-0.71
flyer
-0.67
joints
-0.65
ativity
-0.65
sie
-0.65
messenger
-0.65
drums
-0.64
PROG
-0.62
RIC
-0.62
POSITIVE LOGITS
arov
2.30
barred
2.02
allow
1.96
allows
1.69
allowed
1.53
disqual
1.38
imming
1.30
imposed
1.29
disqualified
1.23
imov
1.05
Activations Density 0.039%