INDEX
Explanations
phrases that denote preferences or affiliations with individuals or groups
New Auto-Interp
Negative Logits
theless
-0.81
cffff
-0.76
discard
-0.76
disse
-0.74
ingest
-0.70
elig
-0.69
rent
-0.68
cellaneous
-0.67
farm
-0.67
scaven
-0.66
POSITIVE LOGITS
��
0.93
Vive
0.89
Gears
0.78
Yor
0.73
Arnold
0.73
Irving
0.73
Tesla
0.72
Musk
0.72
Eliot
0.72
Shake
0.71
Activations Density 0.061%