INDEX
Explanations
phrases related to various aspects of society, such as culture, reality, work, and beauty
terms associated with abstract concepts and societal structures
New Auto-Interp
Negative Logits
ificantly
-0.78
Important
-0.66
volent
-0.65
orthy
-0.63
untarily
-0.61
orously
-0.61
Important
-0.60
regulated
-0.60
noxious
-0.59
isoft
-0.59
POSITIVE LOGITS
ounters
0.82
antry
0.81
confines
0.73
afforded
0.72
mund
0.71
iences
0.65
smanship
0.64
forts
0.63
eers
0.63
of
0.63
Activations Density 0.493%