INDEX
Explanations
questions or phrases that inquire about types or categories of things
New Auto-Interp
Negative Logits
ATIONS
-0.69
uble
-0.67
EF
-0.65
ELL
-0.65
Ess
-0.61
Drift
-0.59
Et
-0.57
itations
-0.57
THREE
-0.57
Prelude
-0.55
POSITIVE LOGITS
of
0.98
of
0.93
luster
0.80
nesses
0.72
achu
0.69
icles
0.68
thereof
0.67
OF
0.64
Of
0.64
oft
0.64
Activations Density 0.032%