INDEX
Explanations
phrases indicating exceptions or qualifications in general statements
New Auto-Interp
Negative Logits
رشف
-0.69
-0.65
Халык
-0.64
lenker
-0.62
AsUp
-0.62
'\\;'
-0.62
twimg
-0.61
ligiloj
-0.60
BoxFit
-0.59
сылкі
-0.58
POSITIVE LOGITS
necessarily
0.55
Alike
0.52
lắm
0.49
znacz
0.48
?}",
0.47
šť
0.47
perfetta
0.45
isolato
0.45
andidaten
0.44
końca
0.44
Activations Density 0.299%