INDEX
Explanations
programming-related keywords that indicate variables, conditions, and comparisons
New Auto-Interp
Negative Logits
';
-0.95
';
-0.95
`;
-0.95
`;
-0.90
’;
-0.89
";
-0.89
'];
-0.88
"));
-0.88
”;
-0.86
'));
-0.85
POSITIVE LOGITS
"){1.12
&&
1.05
'){0.95
){0.95
"){
0.89
!")
0.87
")
0.86
%")
0.83
||
0.83
()){0.83
Activations Density 0.152%