The need for data, the need for good data

Another stream of consciousness, this one will be a story that will make some people go “no shit sherlock,” but it’s a lesson I had to learn on my own, so here goes:

My work wants me to make plans for “professional development,” every year I should be gaining skills or insights that I didn’t have the year before.  Professional development is a whole topic on its own, but for now let’s just know that I pledged to try to integrate machine learning into some of my workflows for reasons.

Machine learning is what we used to call AI.  It’s not necessarily *generative* AI (like ChatGPT), I mean it can be, but it’s not necessarily so.

So for me, integrating machine learning wasn’t about asking ChatGPT to do all my work, rather it was about trying to write some code to take in Big Data and give me a testable hypothesis.  My data was the genetic sequences of many different viruses, and the hypotheses were: “can we predict which animal viruses might spill over and become human viruses?” and “can we predict traits of understudied viruses using the traits of their more well-studied cousins?”.

My problem was data.  

There is actually a LOT of genetic data out there in the internet.  You can search a number of repositories, NCBI is my favorite, and find a seemingly infinite number of genomes for different viruses.  Then you can download them, play around with them, and make machine learning algorithms with them.

But lots of data isn’t useful by itself.  Sure I know the sequences of a billion viruses, what does that get me?  It gets me the sequences of a billion viruses, nothing more nothing less.

What I really need is real-world data *about* those sequences.  For instance: which of these viruses are purely human viruses, purely animal viruses, or infect both humans AND animals?  What cell types does this virus infect?  How high is the untreated mortality rate if you catch it?  How does it enter the cell?

The real world data is “labels” in the language of machine learning, and while I had a ton of data I didn’t have much *labelled* data.  I can’t predict whether an animal virus might become a human virus if I don’t even know which viruses are human-only or animal-only.  I can’t predict traits about viruses if I don’t have any information about those traits.  I can do a lot of fancy math to categorize viruses based on their sequences, but without good labels for those viruses, my categories are meaningless.  I might as well be categorizing the viruses by their taste, for all the good it does me.

Data labels tell you everything that the data can’t, and without them the data can seem useless.  I can say 2 viruses are 99% identical, but what does that even mean?  Is it just two viruses that give you the sniffles and not much else?  Or does one cause hemorrhagic fever and the other causes encephalitis?  

I don’t know if that 1% difference is even important, if these viruses infect 2 different species of animals it’s probably very important.  But if these viruses infect the same animals using identical pathways and are totally identical in every way except for a tiny stretch of DNA, then that 1% is probably unimportant.

Your model is only as good as your data and your data is only as good as your labels.  The real work of machine learning isn’t finding data, it’s finding labelled data.  A lot of machine learning can be about finding tricks to get the data labelled, for instance ChatGPT was trained on things like Wikipedia and Reddit posts because we can be mostly sure those are written by humans.  Similarly if you find some database of viral genomes, and a *different* database of other viral traits (what they infect, their pathway, their mortality rate), then you can get good data and maybe an entire publication just by matching the genomes to their labels.

But the low-hanging fruit was picked a long time ago.  I’m trying to use public repositories, and if there was anything new to mine there then other data miners would have gotten to it first. I still want to somehow integrate machine learning just because I find coding so enjoyable, and it gives me something to do when I don’t want to put on gloves.  But clearly if I want to find anything useful, I have to either learn how to write code that will scrape other databases for their labels, create *my own data*, or maybe get interns to label the data for me as a summer project.  

Stay tuned to find out if I get any interns.

Gene drives and gingivitis bacteria

One piece of sci-fi technology that doesn’t get much talk these days is gene drives. When I was an up and coming biology student, these were the subject of every seminar, the case study of every class, and they were going to eliminate malaria worldwide.

Now though, you hardly hear a peep about them. And I don’t think, like some of my peers, that this is because anti-technology forces have cowed scientists and policy-makers into silence. I don’t see any evidence that gene drives are quietly succeeding in every test, or that they are being held back by Greenpeace or other anti-GMO groups.

I just think gene drives haven’t lived up to the hype.

Let me step back a bit: what *is* a gene drive? A gene drive is a way to manipulate the genes of an entire species. If you modify the genes of a single organism, when it reproduces only at most 50% of its progeny will have whatever modification you give it. Unless your modification confers a lot of evolutionary fitness to the organism, there is no way to make every one of the organism’s descendants have your modification.

But a gene drive can do just that. In fact, a gene drive can confer an evolutionary disadvantage to an organism, and you can still guarantee all of the organism’s decedents will have that gene. The biggest use-case for gene drives is mosquitoes. You can give mosquitoes a gene that prevents them from sucking human blood, but since this confers an evolutionary disadvantage, your gene won’t last many generations before evolution weeds it out.

But if you put your gene in a gene drive, you can in theory release a population of mosquitoes carrying this gene and ensure all of their decedents have the gene and thus won’t attack humans. In a few generations, a significant fraction of all mosquitoes will have this gene, thus preventing mosquito bites as well as a whole host of diseases mosquitoes bring.

Now this is a lot of genetic “playing God,” and I’m sure Greenpeace isn’t happy about it. But environmentalist backlash has never managed to stamp out 100% of genetic technology. CRISPR therapies and organisms are on the rise, GMO crops are still planted worldwide, environmentalists may hold back progress but they cannot stop it.

But talk about gene drives *has* slowed considerably and I think it’s because they just don’t work as advertised.

See, to be effective a gene drive requires an evolutionary contradiction: it must reduce an organism fitness but still be passed on to the progeny. Mosquitoes don’t just bite humans for fun, we are some of the most common large mammals in the world, and our blood is rich in nutrients. For mosquitoes, biting us is a necessity for life. So if you create a gene drive that knocks out this necessity, you are making the mosquitoes who carry your gene drive less evolutionarily fit.

And gene drives are not perfect. The gene they carry can mutate, and even if redundancy is built in, that only means more mutations will be necessary to overcome the gene drive. You can make it more and more improbable that mutations will occur, but you cannot prevent them forever. So when you introduce a gene drive, hoping that all the progeny will carry this gene that prevents mosquitoes biting humans, eventually one lucky mosquito will be born that is resistant to the gene drive’s effects. It will have an evolutionary advantage because it *will* bite humans, and so like antibiotic resistant bacteria, it will grow and multiply as the mosquitoes who still carry the gene drive are outcompeted and die off.

Antibiotics did not rid the world of bacteria, and gene drives cannot rid the world of mosquitoes. Evolution is not so easily overcome.

I tell this story in part to tell you another story. Social media was abuzz recently thanks to a guerilla marketing campaign for a bacteria that is supposed to cure tooth decay. The science can be read about here, but I was first alerted to this campaign by stories of an influencer who would supposedly receive the bacteria herself and then pledged to pass it on to others by kissing them. Bacteria can indeed be passed by kissing, by the way.

But like gene drives, this bacteria doesn’t seem to be workable in the context of evolution. Tooth decay happens because certain bacteria colonize our mouth and produce acidic byproducts which break down our enamel. Like mosquitoes, they do not do this just for fun. The bacteria do this because it is the most efficient way to get rid of their waste.

The genetically modified bacteria was supposed to not produce any acidic byproducts, and so if you colonized someone’s mouth with this good bacteria instead of the bad bacteria, their enamel would never be broken down by the acid. But this good bacteria cannot just live in harmony and contentment, life is a war for resources and this good bacteria will be fighting with one hand tied behind its back.

Any time you come into contact with the bad bacteria, it will likely outcompete the good bacteria because it’s more efficient to just dispose of your waste haphazardly than it is to wrap it in a nice, non-acidic bundle first. Very quickly the good bacteria will die off and once again be replaced by bad bacteria.

So I’m quite certain this little marketing campaign will quietly die once its shown the bacteria doesn’t really do anything. And since I’ve read that there aren’t even any peer reviewed studies backing up this work, I’m even more certain of its swift demise.

Biology has brought us wonders, and we have indeed removed certain disease scourges from our world. Smallpox, rinderpest, and hopefully polio very soon, it is possible to remove pests from our world. But it takes a lot more work than simply releasing some mosquitoes or kissing someone with the right bacteria. And that’s because evolution is working against you every step of the way.