INDEX
Explanations
words related to permission or prohibition
phrases related to permission or restrictions on actions
New Auto-Interp
Negative Logits
leaf
-0.71
lves
-0.70
fish
-0.69
atl
-0.68
xon
-0.68
oslav
-0.63
tal
-0.62
Generation
-0.62
blow
-0.60
bush
-0.59
POSITIVE LOGITS
Reviewer
1.01
exemptions
0.79
plur
0.73
deviations
0.72
crawl
0.71
uthor
0.68
usa
0.67
pez
0.66
downtime
0.66
pedia
0.66
Activations Density 0.045%