INDEX
Explanations
proper nouns, specifically names of individuals or places
mentions of religious figures, specifically the Pope
New Auto-Interp
Negative Logits
milo
-0.87
eland
-0.71
Queue
-0.70
court
-0.68
lawy
-0.64
doors
-0.64
grave
-0.63
Nib
-0.63
Disk
-0.63
committee
-0.63
POSITIVE LOGITS
choes
0.65
20439
0.64
ONG
0.64
ichick
0.62
hin
0.62
ishy
0.61
iami
0.61
pecially
0.59
pes
0.59
zik
0.59
Activations Density 0.000%