INDEX
Explanations
comparisons in the context of improvement or deterioration
evaluative language related to improvement and decline
New Auto-Interp
Negative Logits
entirety
-0.77
heid
-0.70
iao
-0.65
hemisphere
-0.64
Lau
-0.63
halves
-0.60
apple
-0.60
holder
-0.59
ellow
-0.59
Hong
-0.58
POSITIVE LOGITS
sidx
0.86
mileage
0.83
traction
0.83
ãĤ¼
0.82
noticed
0.80
ModLoader
0.79
veter
0.77
retty
0.74
wcs
0.72
puberty
0.71
Activations Density 0.088%