INDEX
Explanations
text containing the acronym "cs" with varying strengths of activation
instances of "cs" and "cb" related to computer science or coding terminology
New Auto-Interp
Negative Logits
temptation
-0.72
voting
-0.68
conspicuous
-0.65
redistributed
-0.64
homophobic
-0.63
publicity
-0.62
JUSTICE
-0.61
discrim
-0.61
ZIP
-0.61
Mald
-0.61
POSITIVE LOGITS
cs
1.27
olars
1.00
vier
0.91
ules
0.87
ancel
0.87
uits
0.86
ouls
0.85
aucas
0.84
ubs
0.84
kies
0.84
Activations Density 0.006%