Data Scientists Dial Back Use of Open Source Code Due to Security Worries

Vulnerabilities in open source components — such as the widespread bugs revealed 10 months ago in Log4j 2.0 — have forced data scientists to reevaluate the open source code often used in analytics and the creation of machine learning models.

According to a report by Anaconda, a data science platform company, 40% of surveyed data scientists, business analysts and students have reduced their use of open source components in the past year, while a third remained stable and only 7% incorporated more open source code into their projects . The majority of respondents do not report to the information technology department (18%) but work within their own data science or research and development group (47%), according to Anaconda’s “2022 State of Data Science” report, released last week.

While software developers and IT have already started looking into secure code, concerns about the security of open source software are a relatively new trend for the data science world, said Peter Wang, co-founder and CEO of Anaconda.

“We see a huge proportion of people who are in organizations where IT has created a very tight stance around open source and Python,” he says. “These aren’t expert developers… They’re data scientists and machine learning people who may not be very experienced developers at all, using whatever they could download to do their analysis and then handing it off to IT.”

The security of open source components – and the software supply chain in general – has become a primary consideration among software developers, companies and national governments over the past two years. In May, the American National Institute of Standards and Technology (NIST), for example, issued guidelines for managing risks in the software supply chain. Additionally, a growing number of software vendors have joined the Linux Foundation’s Open Software Security Foundation (OpenSSF).

Vulnerability scanning and use of proprietary software most common.
While many data science teams scan open source components for vulnerabilities, many instead create their own software. Source: Anaconda’s “2022 State of Data Science” report.

Overall, the maturity of organizations’ security efforts has improved. About half of companies have an open source security policy in place, leading to better performance in terms of security preparedness, according to the June survey. In addition, efforts to control open source risk have increased by 51% in the past 12 months, a security maturity study stated on September 21.

“[W]with the focus on software supply chains, most enterprise organizations are taking a risk-based approach to application security,” said Jason Schmitt, general manager of Synopsys’ Software Integrity Group, in a statement announcing the study. “Such an approach recognizes that security is not limited to the code base; it includes the process of software development where security reviews and testing ‘change everywhere’ to continuously improve security outcomes.”

Developers are expanding the use of open source

Software companies aren’t seeing any kind of decline in open source use, according to other data. Instead, development organizations focus on improving the security of open source software and use security as a primary guide in component selection.

For example, in the “2021 State of the Software Supply Chain” report, Sonatype found that the top four open source ecosystems—Maven Central Repository (Java), Node.js (JavaScript), Python Package Index (Python), and the NuGet Gallery ( .NET) — housed 37 million open source projects and components, a 20% increase over the previous year. Demand for these components is also increasing: more than 2.2 trillion components were downloaded, an annual increase of 73%.

A self-reported move away from open source packages by the data science community is likely a sign of greater awareness of security concerns and less of discarding open source components in development, says Tracy Miranda, head of open source at Chainguard.

While data science teams and development teams may have reacted differently to major security issues — such as Log4j 2.0 — companies have few options when moving away from one open source package other than to adopt another package whose maintainers have placed a greater emphasis on security, she say.

“Companies are leveraging open source as a way to increase their speed, so if they scale back, what are they scaling back to? Are you writing code in-house? Are you using third-party versions bundled together?” Miranda says, adding that instead, “I think we can expect to see companies be more discerning about the quality of the open source they use, especially related to security features.”

Data scientists are playing catch-up

The disconnect between the two sides is probably due to the different target groups in the different studies. Anaconda’s survey focused on data science professionals, as evidenced by their respondents’ choice of programming language – 58% used Python and 42% used SQL, while only 26% used JavaScript.

A better measure of software developer sentiment is StackOverflow’s “2022 Developer Survey,” which found that while 58% of ‘people learning to code’ use Python, only 44% of professional developers code in that language. On the other hand, 68% of professional developers use JavaScript, according to StackOverflow’s survey.

Additionally, while data science professionals work at companies that overwhelmingly (87%) allow open source software, about a quarter (26%) have minimal IT oversight of their open source choices, the Anaconda report said. In another 18% of companies, the IT department specifies only about half of the available open source components.

Maintainers of the most critical projects—of which there are hundreds, if not thousands—must use safe dependencies, test their own code, and validate the credibility of contributors. Maintainers should also publish a security scorecard — a Google-created initiative now managed by the Open Source Security Foundation (OpenSSF) — that assigns a security grade to a project based on nearly 20 different criteria.

While awareness is likely increasing, there is no quick fix, says Miranda.

“The reality is that the safer options have not previously existed,” she says. “Trimming unnecessary dependencies to reduce the attack surface makes sense, but it’s hard to do once the dependency tree gets big.”


Leave a Reply

Your email address will not be published.