INDEX
Explanations
names of various people
proper names, particularly those related to individuals and organizations
New Auto-Interp
Negative Logits
Perception
-0.76
Italians
-0.61
Ryder
-0.61
Brav
-0.60
Slovenia
-0.58
unity
-0.57
Wolves
-0.56
Citiz
-0.56
demoral
-0.56
Colossus
-0.56
POSITIVE LOGITS
Jr
0.97
nr
0.93
vard
0.82
Sr
0.82
aka
0.82
ensen
0.81
III
0.74
iman
0.74
itars
0.74
velt
0.71
Activations Density 0.335%