INDEX
Explanations
mentions of specific individuals and entities, particularly in entertainment and sports contexts
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.08
3:0.16
4:0.39
5:0.03
6:0.05
7:0.04
8:0.05
9:0.04
10:0.05
11:0.03
Negative Logits
iliated
-1.87
berus
-1.65
contrasting
-1.65
ioch
-1.65
orgetown
-1.62
untarily
-1.62
emale
-1.60
ellery
-1.58
ecycle
-1.57
elled
-1.56
POSITIVE LOGITS
folks
2.28
dudes
2.24
nerds
2.19
oooo
2.15
dear
2.06
Stupid
2.00
fuck
1.91
HAHAHAHA
1.90
sucks
1.87
dude
1.83
Activations Density 0.120%