Chris Miller
@chrismiller.science
2.1K followers 210 following 300 posts
I study cancer at Washington University in St Louis. Cancer Genomics, Bioinformatics, Data Viz, Tumor Evolution, AML, Immunotherapy, Irreverent humor 🧬 🖥️ mostly @chrisamiller on other platforms
Posts Media Videos Starter Packs
chrismiller.science
PSA: dbGaP authorized access/run selectors seem to be not working. I assume it's the shutdown - either they turned the lights off, or something's busted and no one is there to fix it
a man in a suit and tie holds his head in his hands
ALT: a man in a suit and tie holds his head in his hands
media.tenor.com
Reposted by Chris Miller
slavov-n.bsky.social
These 3-L bottles contain one million tiny colored spheres each.

One sphere is black (1 ppm).

Finding the black sphere is comparable to detecting a protein present at ~ 6,000 copies in the proteome of a human cell.

Quantifying the protein requires analyzing multiple jars.
chrismiller.science
I can't ever see the MTHFR gene symbol without doing a double take
Reposted by Chris Miller
jeremymberg.bsky.social
I have been trying to get this published as an op-ed, but I am going to post it here since I think it is timely in light of the "consent" extortion events.

Deafening Quiet from the Scientific Establishment

jeremymberg.github.io/jeremyberg.g...

1/14
jeremymberg.github.io
chrismiller.science
The lessons here:
1) Many gene names are stupid.
2) Edge cases may be rare, but they often matter. (TP53 is a key cancer gene that wouldn't be accessible without some special accommodations here).
3) As always, check your assumptions!
(fin)
chrismiller.science
For our little internal app, this probably won't matter much, and I will either set the number of records to 200 (because we generate almost no traffic) or might code up something that dynamically decides how many queries to return, based on which genes are in the input data. (8/n)
chrismiller.science
For those who are interested, the plot showing cumulative percentage of human HUGO gene names (from ensembl protein-coding genes) covered by a set number of records looks like this. So 8 results covers 99% of genes, 34 results covers 99.9% of genes, and it takes 199 to cover everything. (7/n)
Plot showing how many records need to be returned to ensure that each completely typed gene will be in the list.
chrismiller.science
So in order to guarantee that we'll get "AR" in the list, the value should be 200 records, which seems excessive. My instinctual guess of 30 wasn't bad, and covers 99.89% of gene names, but that's not all of them! (6/n)
a group of pokemon standing next to each other with gotta catch 'em all written on the bottom
ALT: a group of pokemon standing next to each other with gotta catch 'em all written on the bottom
media.tenor.com
chrismiller.science
It introduces a new question, though - this failed on TP53 with 10 results, so how many results need to be returned to handle all genes correctly? A few seconds of bash/grep later, I get the following list of 21 genes that will still fail. (5/n)
199  AR
120  PC
100  KL
78   ZNF7
67   ZNF2
67   CS
58   CP
58   ADA
57   SI
55   ZNF3
52   TH
51   C2
43   MAG
42   ZNF8
42   TNF
41   GPR1
37   DEFB1
36   USP1
36   GAL
34   PLEK
31   MET
chrismiller.science
After some digging, it turns out that mygene.info has a default max of 10 records returned for each query, and the first 10 hits include genes like "TP53TGS", "TP53TG3F", "TP53RK-DT", but not "TP53" itself. Adding "&size=30" to the query allows it to return 30 hits, which solved this problem (4/n)
chrismiller.science
But when I manually tried the query string - something like mygene.info/v3/query?spe... - TP53 didn't appear in the returned json - I know that's not right! (3/n)
chrismiller.science
Digging through the backend code, I found that the tool was storing the ENSG id as the key (sensibly!), and then using an API call to mygene.info to match them up to gene names as they are typed. Seems fine... (2/n)
chrismiller.science
Today's little mystery involving started simply enough: I was hacking on a web tool that autocompletes gene names, and was surprised when searching for "TP53" didn't return that gene. I checked, and it was definitely in the input data, so I was left scratching my head (1/n)
a close up of a man wearing glasses and a wig .
ALT: a close up of a man wearing glasses and a wig .
media.tenor.com
chrismiller.science
How do we get @ensembl.org l the infrastructure they need to not be unresponsive like three times a week? Like can we pass the hat? I'm in for twenty bucks.
Reposted by Chris Miller
baym.lol
Immigrants, particularly on H1Bs, are the lifeblood of American innovation. If you wanted to hurt US competitiveness in the next century, I can think of few more effective ways than a move like this

Even when found illegal, the mere intent will have irreparably harmed our future
Reposted by Chris Miller
justinwolfers.bsky.social
Critical part of the President's new $100,000 charge for H1-B visas: The Administration can also offer a $100,000 discount to any person, company, or industry that it wants. Replacing rules with arbitrary discretion.

Want visas? You know who to call and who to flatter.
chrismiller.science
When you want to do reproducible analysis in R, some packages require you to set a RNG seed. I'm not sure I trust anyone who doesn't immediately run `set.seed(42)`
Reposted by Chris Miller
bedec.bsky.social
Zstandard's --long range mode works wonders for assemblies, but needs uninterrupted single line sequences.

*AllTheBacteria 661k, multiline fasta*
gzip (pigz): 751GB
zstandard --long: 641GB (30% original size)

*Single line fasta*
gzip (pigz): 700GB
zstandard --long: 232GB (10% original size)
chrismiller.science
For sure. Hybrid meetings are the worst of both worlds
Reposted by Chris Miller
internethippo.bsky.social
Everyone has a second full time job being mad at the government now
Reposted by Chris Miller
erictopol.bsky.social
Good news. The House of Representatives stands behind the NIH budget with no cuts.
Reposted by Chris Miller
murray.senate.gov
These are the words of a lunatic who does not belong in government, much less as our nation's top health official.

It is dangerous to allow him to oversee ALL federal health research and public health infrastructure. It is never too late to do the right thing. Fire RFK Jr.
Reposted by Chris Miller
benlangmead.bsky.social
📢🚨📢🚨 Genome Informatics deadline extended to September 8! meetings.cshl.edu/meetings.asp....
Please spread the word. If you are like me and had at least one abstract that wasn't quite ready by last week's deadline, you get another swing. See you there!
Genome Informatics
Cold Spring Harbor Laboratory Meetings & Courses -- a private, non-profit institution with research programs in cancer, neuroscience, plant biology, genomics, bioinformatics.
meetings.cshl.edu
Reposted by Chris Miller
benlangmead.bsky.social
The Genome Informatics conference (@ Cold Spring Harbor Lab, Nov 5 - 8) abstract deadline is **today**. We welcome your submissions! Topics include:
- PanGenomes
- Genome Assembly & Seq. Algos.
- Algorithmic Evo. Bio
- Single Cell & Spatial Omics
- Microbial Genomics
- AI/ML & Integrative Omics
🙏🙏🙏
Reposted by Chris Miller
ewcspottesmith.bsky.social
The Journal of Open Source Software (@joss-openjournals.bsky.social‬) is looking for editors. Come join our team!

I've only been an editor for a few months, but I love working with JOSS. Our peer review process is actually collaborative, and we're Diamond Open Access.

#AcademicSky 🧪 ⚗️ #CompChem
Call for editors | Journal of Open Source Software Blog
Blog for the Journal of Open Source Software • <a href='https://joss.theoj.org'>https://joss.theoj.org</a>
blog.joss.theoj.org