INDEX
Explanations
references to burning and fire-related concepts
New Auto-Interp
Negative Logits
Fauc
-0.17
531
-0.17
strang
-0.16
ilm
-0.16
akt
-0.15
215
-0.15
ing
-0.15
aden
-0.15
d
-0.15
ve
-0.15
POSITIVE LOGITS
alive
0.28
доÑĤ
0.26
ished
0.25
á»ijt
0.22
-toast
0.22
Alive
0.22
ISHED
0.21
alive
0.20
ðŁĶ
0.20
ishing
0.20
Activations Density 0.037%