INDEX
Explanations
phrases indicating correctness or approval
expressions related to the concept of "rightness."
New Auto-Interp
Negative Logits
ĸļ
-0.81
mat
-0.71
ains
-0.69
ipation
-0.68
ulz
-0.65
cit
-0.64
ripp
-0.63
igmat
-0.61
graph
-0.61
Railroad
-0.60
POSITIVE LOGITS
eous
1.30
wing
0.82
winger
0.80
wing
0.78
shore
0.77
aligned
0.76
ward
0.65
fielder
0.65
move
0.65
å¾
0.64
Activations Density 0.041%