INDEX
Explanations
negations and phrases indicating exclusion or absence
New Auto-Interp
Negative Logits
__':
-0.88
colgroup
-0.82
__":
-0.80
__':
-0.71
LabelTagHelper
-0.71
hals
-0.71
iastes
-0.70
gründung
-0.69
__":
-0.68
=$?
-0.68
POSITIVE LOGITS
G
0.58
Kitch
0.57
dro
0.56
I
0.56
d
0.55
desple
0.54
y
0.52
Pog
0.51
p
0.51
check
0.50
Activations Density 0.002%