INDEX
Explanations
references to formal recognition events like awards or ceremonies
New Auto-Interp
Negative Logits
ChildScrollView
-0.84
$_"
-0.80
Majefty
-0.80
ArrowToggle
-0.80
Cuthbert
-0.78
Italijani
-0.78
σθαι
-0.77
MessageState
-0.76
समीक्षक
-0.75
Italijanski
-0.75
POSITIVE LOGITS
↵
1.36
↵↵
1.28
↵↵↵↵↵
1.09
↵↵↵
1.08
↵↵↵↵
1.02
</tr>
1.02
↵↵↵↵↵↵
1.00
↵↵↵↵↵↵↵↵
0.90
↵↵↵↵↵↵↵
0.90
[toxicity=0]
0.88
Activations Density 0.039%