INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
Person
-0.79
20439
-0.77
person
-0.76
mal
-0.73
people
-0.73
Nob
-0.72
"$:/
-0.71
woman
-0.70
hor
-0.69
Michaels
-0.66
POSITIVE LOGITS
shut
0.95
wagen
0.79
icago
0.77
oaded
0.70
enaries
0.70
ornia
0.69
ttle
0.69
ifax
0.66
nr
0.66
ysis
0.65
Activations Density 0.000%
No Known Activations
This feature has no known activations.