INDEX
Explanations
words related to categorization and classifications, particularly in social or systematic contexts
New Auto-Interp
Negative Logits
ors
-0.69
orate
-0.27
ions
-0.27
es
-0.26
or
-0.26
ion
-0.26
orial
-0.25
ed
-0.23
ori
-0.23
aar
-0.23
POSITIVE LOGITS
tempt
0.28
rice
0.27
te
0.25
he
0.25
trib
0.24
rices
0.21
ivity
0.20
tributes
0.20
tempts
0.20
ricks
0.19
Activations Density 0.112%