Another stream of consciousness, this one will be a story that will make some people go “no shit sherlock,” but it’s a lesson I had to learn on my own, so here goes:
My work wants me to make plans for “professional development,” every year I should be gaining skills or insights that I didn’t have the year before. Professional development is a whole topic on its own, but for now let’s just know that I pledged to try to integrate machine learning into some of my workflows for reasons.
Machine learning is what we used to call AI. It’s not necessarily *generative* AI (like ChatGPT), I mean it can be, but it’s not necessarily so.
So for me, integrating machine learning wasn’t about asking ChatGPT to do all my work, rather it was about trying to write some code to take in Big Data and give me a testable hypothesis. My data was the genetic sequences of many different viruses, and the hypotheses were: “can we predict which animal viruses might spill over and become human viruses?” and “can we predict traits of understudied viruses using the traits of their more well-studied cousins?”.
My problem was data.
There is actually a LOT of genetic data out there in the internet. You can search a number of repositories, NCBI is my favorite, and find a seemingly infinite number of genomes for different viruses. Then you can download them, play around with them, and make machine learning algorithms with them.
But lots of data isn’t useful by itself. Sure I know the sequences of a billion viruses, what does that get me? It gets me the sequences of a billion viruses, nothing more nothing less.
What I really need is real-world data *about* those sequences. For instance: which of these viruses are purely human viruses, purely animal viruses, or infect both humans AND animals? What cell types does this virus infect? How high is the untreated mortality rate if you catch it? How does it enter the cell?
The real world data is “labels” in the language of machine learning, and while I had a ton of data I didn’t have much *labelled* data. I can’t predict whether an animal virus might become a human virus if I don’t even know which viruses are human-only or animal-only. I can’t predict traits about viruses if I don’t have any information about those traits. I can do a lot of fancy math to categorize viruses based on their sequences, but without good labels for those viruses, my categories are meaningless. I might as well be categorizing the viruses by their taste, for all the good it does me.
Data labels tell you everything that the data can’t, and without them the data can seem useless. I can say 2 viruses are 99% identical, but what does that even mean? Is it just two viruses that give you the sniffles and not much else? Or does one cause hemorrhagic fever and the other causes encephalitis?
I don’t know if that 1% difference is even important, if these viruses infect 2 different species of animals it’s probably very important. But if these viruses infect the same animals using identical pathways and are totally identical in every way except for a tiny stretch of DNA, then that 1% is probably unimportant.
Your model is only as good as your data and your data is only as good as your labels. The real work of machine learning isn’t finding data, it’s finding labelled data. A lot of machine learning can be about finding tricks to get the data labelled, for instance ChatGPT was trained on things like Wikipedia and Reddit posts because we can be mostly sure those are written by humans. Similarly if you find some database of viral genomes, and a *different* database of other viral traits (what they infect, their pathway, their mortality rate), then you can get good data and maybe an entire publication just by matching the genomes to their labels.
But the low-hanging fruit was picked a long time ago. I’m trying to use public repositories, and if there was anything new to mine there then other data miners would have gotten to it first. I still want to somehow integrate machine learning just because I find coding so enjoyable, and it gives me something to do when I don’t want to put on gloves. But clearly if I want to find anything useful, I have to either learn how to write code that will scrape other databases for their labels, create *my own data*, or maybe get interns to label the data for me as a summer project.
Stay tuned to find out if I get any interns.
