INDEX
Explanations
names related to politics and academia
mentions of specific names and the word "socks."
New Auto-Interp
Negative Logits
thal
-0.87
med
-0.82
joints
-0.78
zon
-0.71
lda
-0.68
headed
-0.67
pton
-0.67
joint
-0.66
gary
-0.65
ibur
-0.64
POSITIVE LOGITS
imental
0.91
ivities
0.85
ipation
0.82
orship
0.79
Davies
0.78
ilon
0.76
ieri
0.75
iar
0.73
atsuki
0.72
rolet
0.72
Activations Density 0.039%