INDEX
Explanations
DeepMind, Bialik, hostilis, Lewy
New Auto-Interp
Negative Logits
adiab
0.45
applic
0.44
implic
0.43
élim
0.43
trase
0.43
mêmes
0.42
spawned
0.42
vivi
0.41
anni
0.41
inti
0.41
POSITIVE LOGITS
illon
0.50
on
0.47
arr
0.47
ano
0.46
iles
0.46
art
0.46
ous
0.45
arn
0.44
ana
0.44
ford
0.44
Activations Density 0.051%