INDEX
Explanations
names of actors or entertainment industry professionals
phrases that introduce examples or lists
New Auto-Interp
Negative Logits
gans
-0.85
istical
-0.82
oric
-0.72
reatment
-0.72
orship
-0.72
enser
-0.71
essing
-0.70
rison
-0.70
idate
-0.69
ivities
-0.69
POSITIVE LOGITS
Alfred
0.75
Cowboy
0.73
Jasper
0.73
Esper
0.73
Beautiful
0.73
Brig
0.72
Exodus
0.72
Bald
0.72
Martha
0.71
Jeremiah
0.71
Activations Density 0.153%