INDEX
Explanations
references to literary figures or works
New Auto-Interp
Negative Logits
thouse
-0.17
anoia
-0.16
tek
-0.16
gage
-0.15
onium
-0.14
949
-0.14
UCLA
-0.14
iglia
-0.14
strar
-0.13
uttle
-0.13
POSITIVE LOGITS
Hem
0.35
hem
0.25
hem
0.20
bull
0.18
Ernest
0.18
Stein
0.18
çĢ
0.17
Cub
0.16
suck
0.16
pic
0.15
Activations Density 0.005%