INDEX
Explanations
phrases related to specific tasks or steps within instructions
New Auto-Interp
Negative Logits
iola
-0.80
uay
-0.77
hap
-0.76
ubb
-0.76
ulz
-0.75
ordable
-0.73
anta
-0.73
leased
-0.71
uga
-0.68
ificial
-0.67
POSITIVE LOGITS
ratio
1.28
ratios
1.01
trope
0.96
initiative
0.95
scenario
0.94
Ratio
0.94
clause
0.93
mantra
0.91
distinction
0.91
mentality
0.89
Activations Density 0.550%