INDEX
Explanations
references to the concept of privilege
references to concepts of privilege
New Auto-Interp
Negative Logits
ood
-0.83
ergy
-0.71
Interstitial
-0.70
ETA
-0.69
GH
-0.67
eros
-0.64
thumbnails
-0.64
tra
-0.63
Natural
-0.62
ail
-0.61
POSITIVE LOGITS
ilege
1.70
privilege
1.66
ileged
1.18
privileges
1.06
privileged
0.89
itism
0.89
Priv
0.85
privile
0.80
imore
0.78
afforded
0.77
Activations Density 0.007%