INDEX
Explanations
specific phrases or words related to being correct, effective, or appropriate
New Auto-Interp
Negative Logits
ĸļ
-0.83
ADRA
-0.72
anned
-0.71
cit
-0.70
ruary
-0.69
ushima
-0.67
ivism
-0.65
bery
-0.65
Lilly
-0.64
oute
-0.63
POSITIVE LOGITS
amount
1.11
combination
0.92
balance
0.87
thing
0.84
sized
0.83
antidote
0.82
attitude
0.82
kind
0.81
circumstances
0.81
temperament
0.81
Activations Density 0.506%