INDEX
Explanations
mentions of formal language or classifications in descriptions
New Auto-Interp
Negative Logits
Lucius
-0.56
gu
-0.46
biling
-0.46
Tiberius
-0.46
äu
-0.45
fluo
-0.45
relais
-0.43
Sully
-0.42
ıyors
-0.42
Giovanna
-0.42
POSITIVE LOGITS
nor
1.15
而是
0.92
nor
0.87
melainkan
0.87
CreateTagHelper
0.82
むしろ
0.82
sondern
0.77
بلکه
0.75
Nor
0.72
tampoco
0.70
Activations Density 2.935%