INDEX
Explanations
the presence of specific formatted code or mathematical expressions
New Auto-Interp
Negative Logits
Theſe
-1.09
Anſ
-0.96
ſeveral
-0.91
iconFacebook
-0.89
་་
-0.89
myſelf
-0.89
―――――
-0.88
verſ
-0.88
ſever
-0.87
themſelves
-0.86
POSITIVE LOGITS
x
1.09
S
0.95
P
0.94
xH
0.93
g
0.92
G
0.88
T
0.88
B
0.88
p
0.87
M
0.87
Activations Density 0.139%