INDEX
Explanations
references to medical or health-related topics, particularly around misinformation and procedures
New Auto-Interp
Negative Logits
âĪĴ
-0.31
-↵↵
-0.20
ãĢľ
-0.19
—↵↵
-0.19
âĪĴ
-0.18
Ù¬
-0.18
ï¼į
-0.18
ðŁĻĤ↵↵
-0.17
”↵↵
-0.17
————
-0.17
POSITIVE LOGITS
--
0.95
--↵
0.67
--↵↵
0.55
--
0.52
"--
0.50
'--
0.50
[--
0.50
.--
0.47
/--
0.46
(--
0.46
Activations Density 0.086%