INDEX
Explanations
specific criteria mentioned in text
terms and phrases related to evaluation standards or guidelines
New Auto-Interp
Negative Logits
joy
-0.72
vironment
-0.71
resent
-0.70
orld
-0.69
ership
-0.69
hand
-0.66
lique
-0.66
owners
-0.65
rodu
-0.65
ston
-0.63
POSITIVE LOGITS
criteria
1.28
erion
1.01
criterion
0.98
witz
0.81
cutoff
0.80
thresholds
0.78
DragonMagazine
0.72
pillar
0.71
ifiers
0.70
idelines
0.70
Activations Density 0.019%