INDEX
Explanations
references to fictional works or content
New Auto-Interp
Negative Logits
!)
-0.61
!】
-0.58
ftagPool
-0.57
?】
-0.57
RuleContext
-0.57
$)
-0.56
occuper
-0.55
%]
-0.55
_)
-0.55
|]
-0.54
POSITIVE LOGITS
".
1.16
"
1.09
”
1.06
”.
1.01
"!
1.00
""
0.99
"\\
0.95
”!
0.92
"</
0.92
",
0.90
Activations Density 0.485%