INDEX
Explanations
No Explanations Found
New Auto-Interp
Head Attr Weights
0:0.09
1:0.08
2:0.08
3:0.07
4:0.08
5:0.08
6:0.07
7:0.08
8:0.07
9:0.08
10:0.08
11:0.08
Negative Logits
naire
-1.64
Duc
-1.56
irlf
-1.54
torch
-1.53
ooter
-1.50
bda
-1.48
mund
-1.48
sleeping
-1.48
ber
-1.46
hower
-1.46
POSITIVE LOGITS
intervened
1.74
profits
1.71
cially
1.58
hops
1.58
profit
1.52
adata
1.49
productions
1.48
Inv
1.47
require
1.46
advis
1.45
Activations Density 0.000%
No Known Activations
This feature has no known activations.