INDEX
Explanations
phrases related to causality or reason
phrases that introduce explanations or justifications
New Auto-Interp
Negative Logits
Contact
-0.88
LAB
-0.83
marine
-0.77
zona
-0.75
contact
-0.71
Movie
-0.70
Minimum
-0.68
BuyableInstoreAndOnline
-0.67
Dro
-0.66
Jr
-0.64
POSITIVE LOGITS
instance
1.24
gotten
1.22
bidden
1.19
example
1.15
centuries
1.09
millennia
0.97
cing
0.95
reasons
0.91
decades
0.91
cible
0.90
Activations Density 0.096%