INDEX
Explanations
references to danger and medical conditions involving health risks
New Auto-Interp
Negative Logits
*"
-0.78
AppComponent
-0.74
#"
-0.74
。"
-0.71
).[
-0.71
."
-0.70
[@
-0.70
)."
-0.69
["
-0.69
."
-0.69
POSITIVE LOGITS
!
1.16
?
1.07
!</
0.83
!-
0.80
!?
0.79
?!
0.79
!');
0.77
?-
0.76
-!
0.75
?</
0.73
Activations Density 0.022%