INDEX
Explanations
expressions of compassion or altruism
New Auto-Interp
Negative Logits
,’”
-0.31
,”
-0.30
,’
-0.28
“
-0.26
“[
-0.25
,’’
-0.25
=”
-0.25
.”
-0.24
”
-0.24
,“
-0.23
POSITIVE LOGITS
"
0.58
'
0.52
's
0.50
'll
0.48
've
0.47
're
0.46
'm
0.44
'd
0.43
("0.42
't
0.40
Activations Density 3.023%