INDEX
Explanations
names, initials, or abbreviations
New Auto-Interp
Negative Logits
-
0.81
T
0.66
N
0.66
1
0.63
3
0.62
S
0.59
V
0.59
2
0.58
-
0.58
/
0.57
POSITIVE LOGITS
.—
0.61
.–
0.59
।,
0.59
.~\
0.56
.;
0.55
.".,
0.55
.,
0.54
.{0.54
.--
0.54
.,"
0.53
Activations Density 0.011%