INDEX
Explanations
references to different technical measurements or metrics
New Auto-Interp
Negative Logits
...↵↵
-0.23
......↵↵
-0.20
......
-0.19
...",
-0.19
....↵↵
-0.18
.....↵↵
-0.17
..."↵↵
-0.17
,...↵↵
-0.17
...↵↵↵
-0.16
...',
-0.16
POSITIVE LOGITS
.*↵
0.26
,*
0.26
.*
0.25
*↵
0.25
.*
0.23
*
0.23
âĢł
0.22
**↵
0.21
âĢ¡
0.21
*↵
0.20
Activations Density 0.002%