INDEX
Explanations
references to roles and responsibilities in a professional setting
New Auto-Interp
Negative Logits
(
-0.23
‘
-0.21
(«
-0.21
("-0.20
(`
-0.20
[]"
-0.19
"[
-0.18
."[
-0.18
—
-0.18
(!
-0.18
POSITIVE LOGITS
--↵
0.38
-↵
0.33
uh
0.26
kind
0.24
sort
0.24
--↵↵
0.24
--,
0.23
--
0.22
um
0.22
--↵
0.22
Activations Density 0.674%