INDEX
Explanations
phrases describing capabilities or skills
references to capability or potential
New Auto-Interp
Negative Logits
Surv
-0.67
1924
-0.66
1926
-0.65
Benz
-0.64
Observ
-0.64
gar
-0.63
Traffic
-0.63
Deer
-0.63
Soda
-0.63
Clover
-0.62
POSITIVE LOGITS
ibility
0.92
auga
0.92
ability
0.91
Ability
0.89
ibilities
0.88
destro
0.87
incap
0.84
anooga
0.84
reys
0.82
Reviewer
0.80
Activations Density 0.032%