INDEX
Explanations
reference to notable historical figures and terms related to cultural and social contexts
New Auto-Interp
Negative Logits
y
-0.28
sh
-0.26
sc
-0.26
sm
-0.24
sWith
-0.24
sp
-0.24
sid
-0.24
set
-0.23
sel
-0.23
sr
-0.23
POSITIVE LOGITS
er
0.27
cury
0.26
ød
0.25
ë§ģ
0.24
idge
0.24
lain
0.24
hyth
0.23
theless
0.23
erer
0.22
most
0.22
Activations Density 0.887%