INDEX
Explanations
references to researchers or authors, particularly their last names
New Auto-Interp
Negative Logits
aise
-0.22
ound
-0.20
ates
-0.19
ange
-0.19
ain
-0.18
andom
-0.18
ади
-0.17
oller
-0.17
ate
-0.16
anch
-0.16
POSITIVE LOGITS
ios
0.20
nj
0.18
icket
0.17
ynes
0.16
asan
0.16
ar
0.15
imens
0.15
alley
0.14
ζα
0.14
il
0.14
Activations Density 0.020%