INDEX
Explanations
explicit mentions of rules or restrictions being imposed on specific behaviors, actions, or groups
phrases related to permissions and prohibitions
New Auto-Interp
Negative Logits
Soldier
-0.70
xon
-0.67
lves
-0.65
lust
-0.62
posure
-0.62
borough
-0.61
athan
-0.61
Generation
-0.61
center
-0.60
sis
-0.59
POSITIVE LOGITS
Reviewer
1.09
uthor
0.87
exemptions
0.79
ommod
0.76
ľ
0.74
allowed
0.72
disclaim
0.70
ptin
0.70
permitted
0.70
ravel
0.69
Activations Density 0.038%