INDEX
Explanations
mentions of religious leaders or figures
New Auto-Interp
Negative Logits
bang
-0.83
skirts
-0.74
hops
-0.70
uring
-0.69
visible
-0.68
efully
-0.68
cham
-0.66
mable
-0.65
bender
-0.65
rek
-0.64
POSITIVE LOGITS
inals
0.97
ity
0.94
itatively
0.80
ITY
0.79
isks
0.78
INAL
0.77
esian
0.76
Cardinals
0.75
Newman
0.73
Fitzgerald
0.72
Activations Density 0.034%