INDEX
Explanations
specific phrases or constructs related to conditional or regulatory language
New Auto-Interp
Negative Logits
myſelf
-1.15
pleaſure
-0.97
ſelf
-0.96
whoſe
-0.94
Houſe
-0.93
Majefty
-0.93
Theſe
-0.92
―――――
-0.92
itſelf
-0.91
Monfieur
-0.91
POSITIVE LOGITS
<bos>
2.45
'
1.20
1.10
’
1.07
"
0.98
)
0.92
↵
0.90
",
0.89
',
0.87
”
0.87
Activations Density 2.648%