INDEX
Explanations
names or proper nouns related to political and community figures
instances of proper nouns and specific names or titles
New Auto-Interp
Negative Logits
REDACTED
-0.65
negro
-0.57
QC
-0.54
:-
-0.50
FF
-0.50
myster
-0.50
ç«
-0.50
©
-0.49
moot
-0.49
manifold
-0.48
POSITIVE LOGITS
apologized
0.63
Enlarge
0.58
cohol
0.55
rouse
0.54
awaru
0.54
packing
0.52
"}],"
0.52
apologize
0.51
arnaev
0.51
Healthy
0.51
Activations Density 1.078%