INDEX
Explanations
instances where permission or prohibition is discussed
instances of the word "allow" and its variations
New Auto-Interp
Negative Logits
enegger
-0.68
Soldier
-0.68
nard
-0.66
star
-0.65
borough
-0.64
bard
-0.64
athan
-0.63
kind
-0.62
figure
-0.62
kaya
-0.61
POSITIVE LOGITS
Reviewer
0.91
us
0.71
ipient
0.71
opol
0.71
auga
0.70
exemptions
0.69
rapists
0.67
disclaim
0.66
Ĭ±
0.65
exceptions
0.65
Activations Density 0.043%