INDEX
Explanations
proper nouns related to literature and media
references to notable individuals and cultural works
New Auto-Interp
Negative Logits
hips
-0.84
cheat
-0.77
ees
-0.76
arnaev
-0.70
lies
-0.70
Domin
-0.70
irmed
-0.69
anova
-0.69
iege
-0.68
een
-0.67
POSITIVE LOGITS
Simpson
0.87
ufact
0.79
pitched
0.72
Homer
0.70
Kurd
0.69
Gohan
0.69
ophon
0.67
Hes
0.66
icably
0.63
ocrates
0.60
Activations Density 0.008%