INDEX
Explanations
references to identities and names
New Auto-Interp
Negative Logits
HER
-0.85
710
-0.83
oka
-0.82
570
-0.80
ibilities
-0.79
615
-0.78
540
-0.77
420
-0.76
ohm
-0.75
550
-0.75
POSITIVE LOGITS
de
1.08
de
1.07
De
0.99
des
0.94
des
0.90
De
0.88
Des
0.87
DE
0.86
DE
0.82
Dele
0.82
Activations Density 0.042%