INDEX
Explanations
references to structures or frameworks that could signify oppression or confinement
New Auto-Interp
Negative Logits
adays
-0.17
owitz
-0.16
[s
-0.16
razier
-0.16
weeney
-0.15
ettes
-0.14
ziej
-0.14
wayne
-0.14
worthy
-0.14
(s
-0.14
POSITIVE LOGITS
une
0.19
ild
0.18
els
0.18
ints
0.17
ils
0.17
ads
0.17
unc
0.17
ips
0.17
iter
0.16
icer
0.16
Activations Density 0.007%