INDEX
Explanations
prepositions that signify locations or affiliations
New Auto-Interp
Head Attr Weights
0:0.01
1:0.01
2:0.04
3:0.04
4:0.08
5:0.02
6:0.03
7:0.43
8:0.03
9:0.02
10:0.15
11:0.10
Negative Logits
alogue
-1.65
Offline
-1.55
メ
-1.54
runner
-1.42
earchers
-1.39
tracking
-1.39
selfie
-1.37
wik
-1.37
proxies
-1.36
GET
-1.34
POSITIVE LOGITS
Engineering
1.63
Oriental
1.54
Civ
1.53
Harvard
1.51
undergrad
1.47
Sciences
1.46
NYU
1.46
Advanced
1.45
MIT
1.45
college
1.43
Activations Density 0.016%