INDEX
Explanations
references to individuals or groups and their relationships
New Auto-Interp
Negative Logits
“
-0.29
(“
-0.26
‘
-0.26
’S
-0.24
’ll
-0.24
’re
-0.23
ï
-0.23
’m
-0.22
’ve
-0.22
“[
-0.22
POSITIVE LOGITS
;s
0.27
's
0.27
'a
0.26
;'
0.25
"'
0.25
'[
0.25
%'
0.22
_'
0.22
()'
0.22
*'
0.22
Activations Density 0.181%