INDEX
Explanations
expressions of gratitude and appreciation
New Auto-Interp
Negative Logits
_refl
-0.15
raw
-0.13
uko
-0.13
tent
-0.13
.gov
-0.13
ativ
-0.13
akan
-0.13
å±¥
-0.13
ark
-0.13
log
-0.13
POSITIVE LOGITS
privileged
0.25
privilege
0.22
opportunity
0.21
privileged
0.21
able
0.19
_timing
0.18
fortunate
0.18
enough
0.17
privileges
0.17
priv
0.17
Activations Density 0.037%