INDEX
Explanations
identifiers followed by underscore
New Auto-Interp
Negative Logits
socalled
0.50
<unused395>
0.43
(‘
0.42
ോദ
0.41
!」
0.40
𒇻
0.38
搂
0.37
റ്റ്
0.37
(«
0.37
outrageous
0.36
POSITIVE LOGITS
_
1.12
-
0.90
・
0.79
\_
0.76
_{0.59
_${0.49
‐
0.46
־
0.46
0.46
-_-
0.44
Activations Density 0.043%