INDEX
Explanations
privileged people and wealth
New Auto-Interp
Negative Logits
xhr
-0.89
ularis
-0.82
tubuh
-0.79
greenrobot
-0.76
maglia
-0.76
Ԁ
-0.76
maestro
-0.75
olor
-0.75
stronger
-0.75
掖
-0.75
POSITIVE LOGITS
privileged
1.80
privilege
1.62
Privilege
1.59
sno
1.55
privileged
1.54
Privilege
1.54
Privile
1.53
elite
1.53
Privile
1.53
wealthy
1.52
Activations Density 0.062%