INDEX
Explanations
references to digital artifacts or technical elements
New Auto-Interp
Negative Logits
e
-0.36
d
-0.30
ept
-0.22
a
-0.21
ebe
-0.21
eless
-0.20
eel
-0.19
ein
-0.18
eh
-0.17
evice
-0.17
POSITIVE LOGITS
8
0.22
0
0.21
9
0.20
5
0.20
7
0.20
6
0.19
3
0.18
4
0.18
2
0.15
00
0.14
Activations Density 0.042%