INDEX
Explanations
words related to specific groups or categories, such as professions, demographics, or ideologies
references to specific groups or categories of people and their roles in various contexts
New Auto-Interp
Negative Logits
scrimmage
-0.66
ulative
-0.59
Held
-0.59
iasis
-0.58
0004
-0.56
Carbuncle
-0.54
rift
-0.54
oward
-0.54
umn
-0.54
ieves
-0.53
POSITIVE LOGITS
itself
1.16
ones
1.11
themselves
0.98
)</
0.89
!).
0.87
thereof
0.83
himself
0.83
herself
0.81
).[
0.77
).
0.76
Activations Density 0.570%