INDEX
Explanations
phrases related to ability or incapability
negations and assertions of ability
New Auto-Interp
Negative Logits
Lens
-0.64
Rutherford
-0.62
enthusi
-0.62
Brent
-0.60
IDES
-0.57
millenn
-0.56
ELD
-0.56
prompts
-0.55
conspicuous
-0.55
Mant
-0.55
POSITIVE LOGITS
't
2.33
NOT
1.32
adian
1.17
afford
1.14
na
1.06
hardly
1.05
´
1.00
berra
0.99
ny
0.99
handle
0.96
Activations Density 0.143%