INDEX
Explanations
linguistic expressions related to rules, regulations, and standards
concepts related to compliance and social norms
New Auto-Interp
Negative Logits
`.
-0.66
iven
-0.62
Written
-0.61
Fra
-0.58
];
-0.56
],
-0.56
åĪ
-0.55
ãĤ»
-0.55
ando
-0.55
Dim
-0.54
POSITIVE LOGITS
deserve
1.14
are
1.05
tended
1.02
aren
1.00
tend
0.99
have
0.99
cannot
0.98
must
0.94
may
0.93
shouldn
0.93
Activations Density 0.470%