INDEX
Explanations
mentions of wolves
references to wolves
New Auto-Interp
Negative Logits
angan
-0.76
NAS
-0.73
arters
-0.72
eways
-0.71
claimed
-0.71
ursed
-0.71
arta
-0.70
office
-0.70
ritic
-0.69
unal
-0.68
POSITIVE LOGITS
wolves
1.41
wolf
1.19
wolves
1.11
Wolves
1.07
Fenrir
1.01
wolf
0.95
hound
0.94
gang
0.93
Wolf
0.90
enstein
0.90
Activations Density 0.010%