Massive amounts of data are generated every day on Earth and beyond - upwards of 2.5 quintillion bytes a day, as estimated by CloudTweaks. This offers exciting opportunities to work with data, in both academia and industry. Which setting is a better fit for you? It depends on how you want to work with data. Although data propels work forward in both academic and non-academic settings, academic and industry folks have different needs of data, and therefore different relationships to data.
When I made the transition from academia to industry 4 years ago, I was excited for the opportunities that lay ahead, but quickly learned I didn’t have the vocabulary to be successful. I applied to the Insight Data Science Fellowship, which offers training on how to apply quantitative skills in an industry job market and that is where I began to learn the differences in how people use, rely on, and speak about data between industry and academia. Over the years, these differences have become more pronounced to me and I hope these descriptions will help clarify what “data” could mean for you in your career in industry and/or academia.
I see differences in how data is used in industry and academia across 3 main domains:
The language that people use in industry was a shock to me when I switched from academia to industry. People love using acronyms. Here’s just a small smattering of them:
I had never heard these acronyms during my time in graduate school, and quickly learned them all on the job. These words / letters can be intimidating if they are new to you, so it’s important to ask questions, pause, and make sure you understand what’s going on or else you could be working towards a goal that no one else is.
Across specialized fields of industry, one acronym or term could even refer to two completely different concepts. Consider the word “platform.” In tech, a platform is a code base that supports a wide range of services or products on top of it, unified by its horizontal supportive nature. In other industries, “platform” could mean that the company supports all steps of the process to get a product to a person‒the marketing, development, distribution, and feedback from the customer for one product is called a platform approach.
The reason I bring up language in an article about data is because data metrics are built out of a shared understanding or a shared lexicon of what the numbers represent. If someone doesn’t know what a “net promoter score” is as a concept, then the number doesn’t hold much value and cannot be used to rally a team towards improving an NPS. Or worse, if someone misinterprets what a “net promoter score” is, it could be used inappropriately.
Specificity of what a term, acronym, or number means is required in order to use data appropriately, and for a company to work towards the same quantifiable goals.
This is not so much a problem within the academic world. For better or worse, once you know the primary metrics a field uses to quantify success (or better yet, have coded them for a study), they don’t usually change and there generally aren’t as many of them. So you can feel comfortable in your understanding of the metric. Personally, I did not get the sense that acronyms were as inhibitory in academia as I felt they were in industry. The closest “alphabet soup” in academia are grant numbers and funding agencies.
I have also noticed differences in how people speak about equations, models, functions, and machine learning across the two contexts. In academia, when you use a complex equation to characterize or predict behavior, it’s commonly called a function or an equation. In industry, these are called “models” or “algorithms”. Models and algorithms, just like equations, can be very simple (one to two terms) or very complex (lots of terms). The idea that a model gets better with more data or trials gives it a “smart” characteristic and industry tends to classify all “self-improving” models under an umbrella of “machine learning.” This can be confusing because if you are making the jump from academia to industry and you have built experiments where every trial is based on the participant’s responses over the last X trials, you have already built a machine learning model. Another concept that goes by different names is “prediction model” in industry vs. a “regression” in academia. Prediction models are usually just regressions, but “prediction model” helps convey its use in the business context - to forecast or predict what will happen based on a pattern in the existing data.
In industry, data is used to inform all stages of a company’s operations‒from tracking how many units of product are created and distributed, to how many people enter or leave a company within a certain timeframe. Data is used to measure the health of the company’s operations. Every step of the revenue-generating process can be tracked and measured, and the culture of the company will decide whether those aspects ARE tracked and measured. If a process metric is not meeting a certain threshold, employees can make changes to their actions in order to improve their process, which should then be reflected in their key performance indicator metrics. At a culturally-unified organization, people have alignment on which metrics they are all working towards and are encouraged to find efficiencies. You can only change what you track!
Given the small size of labs, tracking people leaving or entering the lab is not so necessary. Lab managers or principal investigators (PIs) may track how many participants participate in studies, how many studies are ongoing, or how much money is being spent on different resources. These metrics could be tracked in participant receipt forms, experimental protocols, or “run sheets.” However, unlike in companies, labs do not all gather around the same metrics, working towards the same quantifiable goal. Lab members are typically more focused on a different process: the methods of their experiments. In order to test hypotheses rigorously, researchers must develop methods that sufficiently localize the effect in question within their experimental design. And the design must remain as constant across participants as possible. Only across experimental conditions should the process change.
Another big difference between academia and industry is in how data is managed.
In many academic labs, individuals (graduate students, post-docs, Principal Investigators, etc.) are responsible for maintaining their own file organization and don’t usually collaborate on one file system. Only if necessary, for instance, if you’re sharing a resource like an MRI scanner or other technology do people share a file-system. I didn’t tend to see robust and generalizable data management systems that store data for all people that come through a lab, possibly because people normally cycle in and out of labs on a 4-7 year timespan. So, individuals are responsible for developing what data outputs look like, downloading the data, storing the data, manipulating the data, and hopefully sharing the data as part of a publication. For those not familiar with academic file management, just imagine hundreds of csv files.
Contrastingly, in industry it is much more common to build a Data Warehouse or Data Lake depending on the needs of your company. A Data Warehouse or Data Lake can be used to aggregate and centralize all data sources into one place. A Data Warehouse will process and present the data in ways that support analytics or easy querying, whereas a Data Lake is a repository for your data. Data management is important for companies because multiple company functions may need to query the same data, so it’s important that everyone is working off of the same aggregated data. Without that consistency, different departments could build reports off of different data which can lead to confusion across functions. This is not an issue in academia.
It should probably come as no surprise that with very different goals, for-profit companies and academia focus on very different “outcomes.” Here, I refer to “outcomes” as a measurable result of some process or action.
The result of an experiment in academia can be considered an outcome. Did the experiment reject the null hypothesis or did it fail to reject the null hypothesis? Said another way: did the experimental condition result in a distribution of outcomes that was significantly non-overlapping with what we would expect in a world where the phenomenon did not occur? If this feels like a convoluted way to measure the result of an experiment, I can empathize with you. Experimental outcomes are always assessed in relation to something else, and this is why your experimental protocol and methods are so important. You need to be sure that the only “lever” you’re pulling in the experimental condition is the one you are testing. This is also why complex behaviors like the effect of music training on X are notoriously difficult to use in experiments. Even for testing the effect of simpler phenomena, there is already a necessary difference from how we experience the phenomena in the real world versus in the lab, because well, you’re in a lab being observed and recorded and manipulated, which is not the typical environment someone experiences outside of the lab. Metrics used to assess whether your experimental condition did the thing you set out to understand include confidence intervals, p-values, effect sizes, and several other metrics.
There are other data metrics that come into play in academia as well.
As an academic researcher or professor, your outcomes of interest for your career progression depend on the stage of career a researcher is in but is usually some mixture of: number of publications, impact factors of journals where your publications are accepted, grants and dollars brought into the institution, types of grants, conference posters and presentations, percent of graduate students who obtain jobs, and securement of tenure-track. As the culture of academia shifts, some of these outcomes will be measured differently. For instance, there’s debate over how useful an impact factor is as an indirect outcome or predictor of a research career or publication progress (see: 2, 3). In addition, there are huge racial disparities in NIH grant funding which represents yet another actualization of the systemic racism we are all a part of (see: 4, 5). Therefore, being a great and qualified researcher will not always lead to desired outcomes because of reasons out of your control.
Outcomes in industry are quite different. If the company you work for is a revenue-seeking company, revenue will probably be one of the most important outcomes you work towards. Without revenue, your company’s dream will not be able to take hold. A popular phrase in industry is “revenue is king;” however, alternative interpretations also exist, such as: “revenue is vanity, profit is sanity, and cash is king.” Any of these metrics - revenue, profit, or cash - can be important outcomes for employees to keep their eyes on and support, and for leaders to rally their employees around. But there may also be secondary outcomes that give an indication of how your department is doing. For instance, if you know your product and engineering teams need to be iterating more and releasing more, then a KPI (key performance indicator) for the department could be “number of releases in a quarter”. However, it’s important that any department’s outcomes will have a marked impact on overall corporate outcomes, like revenue.
Outcomes are an area where I think there is important overlap between academia and industry in their approach to collecting and analyzing data. In order for lab-based science to apply and “stick” in the real world, common benchmarks of success, like standardized assessments, need to be used in both the lab and real-world settings, and show the same success in measuring the phenomena. Labs in both academia and industry are creating technologies (often through mobile applications) that can capture and assess data from users no matter where they are located. For example, tools like Neurobehavioral Systems and KeyWise (built out of labs at University of California Davis and University of Illinois, respectively) offer bridging technologies for cognitive assessments. And patient reported outcomes (PROs) are health status checks that are received directly from the patient. These can be done on a mobile phone or over the internet so they are easier to administer in the lab and out in the real world.
If you’re in the market for a data-related position, there are some guiding patterns that may prove helpful to you. Although I wrote about differences in how people speak about data between academia and industry, the linguistic differences around how data is used and spoken about are quite learnable on the job. It may take some research before the interview to understand what’s important and how to talk about what’s important for the job, but I see the language as superficial to the concepts of what the data represent.
On the other hand, you may want to consider what type of data process you are most comfortable in. If you are excited about tracking every step of a person’s journey through the product or service and using that data to make changes on the fly, industry is probably a good bet. Contrastingly, if you want to hone in on a very specific aspect of human behavior and allow humanity understand it more thoroughly, academia is a great place to do that.
In terms of data management, from my experience, academia tends to be more scrappy in how data is captured, stored, and used. As a researcher in the lab, you may be responsible for developing the data capture mechanisms, the file system you use to store the data, writing, conducting, and interpreting the analyses, and preserving the data. In industry, data management systems can run the gamut. At a startup, your file system may look quite similar to that of an academic lab’s, but as the company grows and the data asset grows, data will need to be housed and maintained in a scalable way so that many people can use it for different purposes. In my opinion, the sooner a company can implement data pipelines to be able to query data from storage into analyses, products, or services, and create data documentation the better time you’ll have in accessing and using the data. To get a sense for where a company is in their data asset infrastructure evolution, some good interview questions for you to ask potential employers are: who is responsible for the health of the data infrastructure, how big is the team, and is there documentation for the data models?
In both academia and Machine Learning/Data Science endeavors in industry, a strong background in statistics is recommended. For analyst jobs in industry, you will likely want to feel comfortable using data to talk about business or product outcomes. Oftentimes analyst jobs do not require advanced statistics, but do require an appreciation of how to measure success within a business and the ability to talk about data results with non-technical audiences.
When thinking about which career track may be interesting to you, it can be helpful to think about how you want to spend your time. If you are someone who likes to convey a story or talk about what you can glean from a data set, but aren’t as keen on developing complex models or refining models, I would recommend seeking out data analyst jobs in industry. As a data analyst you may be mining data to understand the “what” of a situation and report the “what” to business stakeholders. For instance, you may use data to see at what step of the onboarding funnel participants drop off. When you bring the “what happened” to the conversation, other business stakeholders like marketing and product can add the “why”, and a collaborative story begins to unfold.
If you are someone who likes building out the best model for a behavior and enjoy tinkering and tweaking in code, then I would recommend looking into Machine Learning or Data Science positions in industry. As a data scientist a lot of jobs include running A/B tests to see which version of a product or experience produce the desired effects, for example more clicks, more purchases, or more views. As a Machine Learning Engineer or Data Scientist, you could spend your time developing training data sets, and tweaking model parameters incrementally to describe the behavior or predict behavior, or provide an experience that varies based on prior experiences. These roles likely occur on a more micro-scale of human behavior description, for example optimizing clickthrough rate (CTR) from one website page to another, whereas a data analyst will likely be working on broader problems like “why are people not completing their purchases?”
If you’re interested in managing data-related operations like how data sources interact, maintaining data quality, tracking data health, while not necessarily analyzing the data for product or service insights, I would recommend seeking out Data Manager, Data Coordination, or Data Product Manager roles.
Generalist Data Science positions may span data coordination and data analyst positions and are great for people who like working with data and are comfortable playing different hats within the data system at a company.
And, if you have a strong curiosity for a topic or phenomenon in life, and are eager to test your assumptions in a rigorous way, I would recommend seeking out PhD positions that have a quantitative component to them. Academic research primarily focuses on the science, but data analysis is a tool for measuring scientific findings. All of the aforementioned positions will use data, but your relationship to it and day-to-day interactions with it will be very different.
Lastly, it is important to remember that data skills are transferable across companies, domains, and jobs. So if you start off in one data-related career but realize it’s not for you, there are so many companies and institutions with unique data needs that with a bit of research and patience, you may be able to find a version of the data-role that better suits your skills and interests. By “transferrable”, I also mean that the data skills gained during a PhD program in academia can be packaged as industry-relevant skills and vice versa (with the caveat that I recommend having a passion in a topic or phenomenon to move into academia). Data is a common language across academia and industry, and so you can market it as such. Oftentimes graduate programs don’t educate graduate students on the transference of skills gained in a PhD to non-academic settings, but it is an important lever that graduate students have in crafting their career trajectory.
Although I only discussed data-related differences and similarities between academia and industry, there are a lot of other considerations when deciding between the two career tracks. For instance, geographic flexibility, salary, promotion schedules, etc. It is also important to know that you can span both areas AND zig-zag back and forth, if you so desire. I have heard sentiments from graduate students still in school, and also PhD’s who have gone into industry that once you leave academia you cannot go back. While your engagement with academia may look different than a “traditional” academic track, there are jobs and other opportunities to stay involved in academia even after “leaving” academia for an industry job. Some examples include analyzing data from your time as a graduate student that you never published on, attending conferences and webinars to stay relevant in your field, seeking out opportunities that have ties to an academic lab, and engaging in networks of other PhD’s in industry. This is important because you can use your skills of working with data to leverage movement back and forth between industry and academia.
I hope this description is helpful as you think about what a career in industry or academia may look like, and how you might interact with data in either (or both!) career tracks. Regardless of which path you take, academic or industry, having experience working with data, analyzing data, or making decisions with data will create flexibility and options for your career trajectory.