19-PM2-08: ST1.3 - Data Science for Innovation Challenges
The information field has changed dramatically over the past years, affecting the economy, technology, culture and society. However, these changes have left an even stronger mark on business systems (Jin 2015). Considering the mass of digital information produced in the past 10 years, companies have found themselves in a chaotic and constantly expanding digital universe. To innovate and stay competitive, companies must master methods and tools to prevent information overload, while gaining useful knowledge from the available data (Feng 2015).
The discipline of Data Science has emerged as a clear (although broad) field of research to solve data-related problems (Provost 2013). Data science is an interdisciplinary field that uses scientific methods to extract knowledge and insights from structured and unstructured data. It attracts researchers and encompasses methodologies from wide-ranging fields such as statistics, mathematics, information science, computer science , data analysis, machine learning and communication and is therefore an ideal tool to bridge the gap between research, industry and society (Waller 2013).
The objective of the present track is to collect works that use state-of-the-art Data Science tools and techniques to gather, transform, model and visualize data (Wickham 2014) to gain valuable information relevant for firm innovation. The scope is to use publicly available data to obtain a clearer view of which information sources contain the most untapped value and which methods and tools can be used to uncover it.
The main contributions are expected to highlight which information is relevant for different companies to build knowledge as a tool for innovation, in particular related to:
- data science for product innovation (Chiarello 2018a; Tan 2015): e.g. data-driven product development, A/B testing, patents analysis, product success evaluation, machine learning for innovation.
- data science for technology intelligence (Colladon 2018; Chiarello 2018b): e.g. brand analysis, competitors mapping, partners individuation, tools for knowledge visualization and communication.
- data science for open innovation & co-creation (Hoornaert 2017): e.g. papers mapping, open analytics, IP analysis, cloud computing.
- data science for new skill identification & mapping (Frey 2017): e.g. curricula analysis, job vacancies identification, job creation, new skills for innovation.
We expect to see contributions coming from the usual sources (e.g. open databases, patents, papers, social media) but we especially welcome contributions from less-known sources. Since Data Science is broad, we expect to showcase a wide range of methodologies such as machine learning, deep learning, natural language processing, image analysis or tools for data visualization and communication, to name a few.
Patent infringement analysis using Text mining technique based on SAO structure
Dongguk University, Korea, Requblic of (South Korea)
Recently, the NPE(Non-Practicing Entities) has become a hot topic, and the cost of patent infringement damage is significant. Thus the importance of analysis has increased. Identifying infringement needs various factors, the most appropriate method is expert analysis .but, automated quantified analysis is needed when data continues to grow-.
The patent infringement analysis using automated and quantified methods can be divided into using structured data and using unstructured data.
In case of structured data, network analysis is performed using the cited data or the patent document kind codes, . This can't cover all the criteria for patent infringement and can't use a DB without a citation. It also takes time to draw a cited network, and since contemporary patents are omitted, they tend to underestimate the importance of new patents.
In case of unstructured data, usually text data are used. Text data are converted into structured format using text mining techniques, and analyzed using various algorithms. Huang, S.H had analyze using SOFm algorithm, and Lee, S. had analyze using principal component analysis (PCA). In addition, patent infringement analysis using SAO structure has been studied a lot, Park, H. and Park, I. are typical examples.
All the literature focused on identifying possible infringing patents with high technical similarities, defining infringement possibilities as technical similarities in sentences, SAO structures, documents, etc. In other words, the scope of the patent right was not considered, almost only use patent document, and the types of infringement were not considered.
1.1 How to identify patent-infringement cases systemically considering factors and various data-sets?
1.2 How to reflect various factors (such as technical similarity, scope of patent rights)?
1.3 How to use variety of data-sets(such as patents, product spec)?
2.1 How to judge direct infringement and indirect infringement, and score them?
In the first module, pre-preparing and SAO2Vec is performed. Selecting the target and collecting patent documents and product documents, and performing SAO2Vec, an algorithm defined using the Stanford parser and SAO structure, doc2vec methodologies. In the second module the SAO vector from module 1 is used to identify potential patent infringement cases, and the document's similarity and identifies the candidate case for patent infringement is performed. Lastly, after extracting the rights components of candidate cases, the calculation of direct infringement score through the identity, ease of substitution analysis, and calculation of indirect score through the importance analysis are carried out.
The proposed framework was illustrated using data from patents and products of real-world companies. In this study, we will conduct an analysis on the gaming device(console) industry in the United States. In the United States, a patent infringement suit requires that a product that infringes on its patent be submitted. Thus, there are many cases of lawsuits by industry that are easy to obtain sample products, and the game device industry is the same.
Thus, the analysis data set was defined as the patents of game device companies in the U.S. and product descriptions of each company's products, specs and user review data. Totally, in 15 companies collected 189059 patents and collected product documents for 38 product lists derived based on each company's official website. The database used to obtain product data is the official website of each company, product information of Amazon and user review.
In addition, the actual patent infringement litigation data-sets was used as a set of verification data to verify the analysis results, and the litigation data was collected from the courtlistener(https://www.courtlistener.com) which is Sponsored by the non-profit company, Free Law Project. Using the data collected, the validity of the analysis results will be demonstrated.
In the first module, the SAO structure vector and document vector were derived. In the second module, candidate of potential patent infringement were derived. For example, (patent US7145461, product Nintendo wii) was derived, which means that it is possible that Nintendo wii infringed on patent US7145461. A total of 43 examples have been drawn, and these are the input data of the process for performing the scope of rights analysis in Module 3. The final result of Module 3 is direct and indirect infringement scores. Direct and indirect infringement score have a value between 0 and 1, and the larger the score, the more likely it is to be an infringement.
An average of direct infringement score, 0.49 and an average of indirect score, 0.37. were set to the threshold. The case of (patent a, product b) 's direct score is 0.6, and the indirect score is 0.34. This case is interpreted as direct infringement, because the direct score exceed the threshold and the indirect score doesn't. Direct and indirect scores were calculated for all cases, and 60% accuracy was obtained when compared with the validation data.
Contribution to Scholarship
The existing patent-infringement analysis was able to present possible cases using only the technical similarity, didn't consider the type of infringement, dependent on expert knowledge, and many of them just use patent document. This study use not only technical similarities in documents but also components of claims and scope of patent rights. Also use various kinds of data-sets such as patent documents, product spec.
In the other words, this study makes the objective and systemic analysis considering various factors and data-sets possible without the knowledge of experts. Another contribution of this study is that the results can show types of infringement, which the direct infringement score considering AER(all elements rule) and DOE(doctrine of equivalents), and the indirect infringement score considering priority of components. There is also a contribution point to enabling analysis using large data-sets.
Contribution to Practice
As mentioned earlier, patent infringement analysis is a very important issue for businesses that may be directly linked to the problem of survival. The studies so far presented have only been able to present the possibility of patent infringement analysis and have been expert knowledge base. However, this is a difficult task in reality because it requires a lot of time and resources. The contribution of proposed framework is complementing these limitations and conducting quantified patent infringement analysis automatically using large data sets.
This paper applies NLP and text mining techniques to the patents, product and user review data obtained from the Internet to perform patent infringement analysis. This is an attempt to solve one of the firm innovation challenges by extracting information that can provide insight into the intellectual property, patent management.
 Soo, V. W., Lin, S. Y., Yang, S. Y., Lin, S. N., & Cheng, S. L. (2006). A cooperative multi-agent platform for invention based on patent document analysis and ontology. Expert Systems with Applications, 31(4), 766–775.
 Lai, Y. H., & Che, H. C. (2009). Modeling patent legal value by Extension Neural Network. Expert Systems with Applications, 36(7), 10520–10528.
 Crampes, C., & Langinier, C. (2002). Litigation and settlement in patent infringement cases. The RAND Journal of Economics, 33(2), 258–274.
 Park, H., Yoon, J., & Kim, K. (2011). Identifying patent infringement using SAO based semantic technological similarities. Scientometrics, 90(2), 515-529.
 Durham, A. L. (2004). Patent law essentials: A concise guide. Westport, CT: Praeger Publishers.
 Wallerstein, M. B., Mogee, M. E., & Schoen, R. A. (1993). Global dimensions of intellectual property rights in science and technology. Washington, DC: National Academies Press.
 Majewski, S. E., & Williamson, D. V. (2004). Incomplete contracting and the structure of R&D joint venture contracts. In Intellectual property and entrepreneurship (pp. 201-228). Emerald Group Publishing Limited.
 Arundel, A. (2001). The relative effectiveness of patents and secrecy for appropriation. Research Policy, 30(4), 611–624.
 Lee, C., Song, B., & Park, Y. (2013). How to assess patent infringement risks: a semantic patent claim analysis using dependency relationships. Technology analysis & strategic management, 25(1), 23-38.
 Sternitzke, C., Bartkowski, A., & Schramm, R. (2008). Visualizing patent statistics by means of social network analysis tools. World Patent Information, 30(2), 115-131.
 Kasravi, K., and M. Risov. 2009. Multivariate patent similarity detection. Paper presented at the 42nd Hawaii InternationalConference on System Sciences, January 5–8, in Hawaii, USA.
 Huang, S.H., H.R. Ke, and W.P. Yang. 2008. Structure clustering for Chinese patent documents. Expert Systems with Applications 34: 2290–97.
 Lee, S., B. Yoon, and Y. Park. 2009. An approach to discovering new technology opportunities: Keyword-based patent map approach. Technovation 29: 481–97.
 Park, H., Yoon, J., & Kim, K. (2011). Identifying patent infringement using SAO based semantic technological similarities. Scientometrics, 90(2), 515-529.
 Park, I., & Yoon, B. (2014). A semantic analysis approach for identifying patent infringement based on a product–patent map. Technology Analysis & Strategic Management, 26(8)
 Klein , D., & Manning, C. D. (2003, July). Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1 (pp. 423-430). Association for Computational Linguistics.
 T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," in Advances in neural information processing systems, 2013, pp. 3111-3119.
The Data Science of Startup Competition
1Agoranov, Paris, France; 2Sorbonne Université, Paris France; 3i3-CNRS, École Polytechnique, France
Using Natural Language Processing techniques, we have developed a way to compute similarity between startup descriptions, effectively measuring competition. From this measure, we can visualize maps of competitors from a given startup, as well as using competition features in prediction models.
Competition has been a subject of study for decades. In the specific case of startups however, very little has been done.
To our knowledge, it has only been applied in the context of M&A , or through crowdsourcing , which limits the range of potential applications.
It is especially surprising as it can be particularly useful either as a simple similarity service  or used to improve existing startup success predictions.
In a previous paper, we have designed a way to visualize startup ecosystems through treemaps which is still applicable to competitors .
With these technological bricks, coupled with the now large availability of startup databases, we propose a tool to automatically visualize a startup's competitors.
As far as we know there exists no tool available to automatically show competition. Some approaches have been proposed in the contexts of M&A, but not for companies in general. Moreover, competition has been largely overlooked for startup success prediction models, although it seems like a promising feature.
How to effectively and automatically compute competition of a given entity, and how can we visualize it in a way all relevant information is seamlessly available ?
Can it be used in a predictive model ?
We used Natural Language Processing techniques, namely word2vec  and Smooth Inverse Frequency  to compute vectorial representations of each startup description in Crunchbase database.
From this, we can measure the similarity between different companies and derive its competition from it.
We can then visualize this using Treemaps, which allow us to order the information and make it more accessible.
A prediction model was also implemented using Random Forest algorithms, and competition-related features' importance were measured by removing them from the model and measuring the decrease in accuracy
We used as our base data all the information available on Crunchbase, a mainstream source of data for academic research specialized in startup and funding rounds data .
We retrieved general data on startups such as founded date, location, funding rounds and investors, as well as detailed information about the activity of the startups in the form of a description and associated keywords.
Our database then contains 618 366 companies, 221 299 investment rounds, 783 787 people and 6 363 831 news articles.
We have developed a tool which given a text description returns all of its competitors. We can then map them automatically, ordering them by industry, location or technology. Metadata is also incorporated through the size of cells, proportionate to the funding raised by the startup, as well as color in the fashion of heatmaps, heat representing either similarity to the description of reference or the "vitality" of the startup, which is a proxy for the cash burn of the company.
Using our prediction model to predict which startup will make another funding round, we show that competition is a crucial feature in early stage, increasing accuracy by 6 points. However, this effect completely disappears in later rounds.
Contribution to Scholarship
In this paper, we show that competition is an important overlooked feature for predicting early stage venture rounds in startups, but not growth stage ones. This effect has not been observed yet to our knowledge and raises questions on the organization and dynamics of venture capital investing.
Contribution to Practice
We aim at releasing a tool that would help entrepreneurs, investors and policy makers alike to get better knowledge of their competitive ecosystem.
This increased knowledge might help actors circumvent the "competitor neglect" bias and improve the allocation of funding and energy.
Our paper presents a tool which goal is to improve innovation strategies and deployment using data science techniques, in particular through competition and technology intelligence mapping.
 - Dalle, J. M., den Besten, M., & Menon, C. (2017). Using Crunchbase for economic and managerial research, OECD Working Paper.
 - Gastaud, C., Lacroix, T., Dion, G., Taub, R., & Dalle, J. M. (2018). Visualizing startup ecosystems and characterizing their diversity. In Proceedings of the R&D Management Conference 2018.
 - Owler Inc. Owler: Competitive Intelligence to Outsmart Your Competition [Internet]. 2019 February. Available from: https://www.owler.com/
 - Drieux P. FinTech Landscape [Internet]. 2018 January. Available from: https://www.vbprofiles.com/l/fintech
 - Strachman P. French AI Ecosystem [Internet]. 2018 January. Available from: http://franceisai.com/startups/
 - Bhamidipaty, A., Gruen, D., Kephart, J. O., Patel, S. S., Platz, J., Soroker, D., ... & Webb, A. (2018). Towards a Generalized Similarity Service.
 - Shi, Z., Lee, G. M., & Whinston, A. B. (2016). Toward a Better Measure of Business Proximity: Topic Modeling for Industry Intelligence. MIS Quarterly, 40(4).
 - Sharchilev, B., Roizner, M., Rumyantsev, A., Ozornin, D., Serdyukov, P., & de Rijke, M. (2018, October). Web-based Startup Success Prediction. In Proceedings of the 27th ACM. International Conference on Information and Knowledge Management (pp. 2283-2291). ACM.
 - Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed represen-
tations of words and phrases and their compositionality. In Advances in neural information
processing systems (pp. 3111-3119).
 - Arora, S., Liang, Y., & Ma, T. (2016). A simple but tough-to-beat baseline for sentence
Exploring disruptive innovation opportunity using patent analysis and deep learning
1Institute for Manufacturing(IfM), University of Cambridge, United Kingdom; 2Institute for Manufacturing(IfM), University of Cambridge, United Kingdom; 3Institute for Manufacturing(IfM), University of Cambridge, United Kingdom
There were challenges about ex-ante prediction for disruptive technology because retrospective analysis causes “hindsight is always 20/20” with requiring rich empirical data and historical cases. Lack of ex-ante prediction about disruptive innovation through ex-post evaluation used to cause quite outdated prediction.
There were many arguments on disruptive innovation since introducing the concept of disruptive technology by Christensen (1997) (Daneels (2004), Shuelke-leech, 2018). The initial studies focused on technology itself, but the object of disruptiveness has expanded as time goes on. Daneels (2004) posed a notion of disruptive innovation, and Govindarajad and Kopalle (2006) discriminated disruptive innovation and radical but not disruptive innovation in terms of market segment, performance, target segment, and dilemma. With increasing interests in disruptive innovation, several studies tried to evaluate and detect disruptive innovation. A variety of approaches were suggested from scoring model to economic and scenario models (Adner, 2002; Rafil and Kampa, 2002; Vojak and Chambers, 2004). They attempted to measure disruptive potential of new innovation by survey with scoring model or to simulate the potential entry and diffusion of disruptive innovations in existing markets from economic perspectives or by scenario models.
Although the issue of concept for ‘disruptive innovation’ has received considerable attention, still ambiguous definitions about object of ‘disruptiveness’ are prevalent. Questions have been raised about effectiveness of ex-post prediction which is hard to identify opportunity in advance and then to cope with disruptiveness at the right timing.
What is the critical feature of disruptive innovation? How can disruptive innovation and technology be detected at the early stage? After finding disruptive technology, how is this disruptive technology implemented as product or service? Is it possible for artificial intelligence to assist to discover disruptive innovation?
This paper is an analytical and quantitative research that mainly utilizes machine learning and deep learning with patent and opinion data to discover disruptive innovation. First, the concept of disruptive innovation is refined by literature reviews. Second, the critical characteristics of disruptive innovation are defined as features for learning and prediction based on refined concept. Third, potential disruptive technology is extracted by unsupervised learning with features among candidates. Finally, disruptive innovation is derived by LSTM algorithm handling sequential data. After disruptive technology is classified among various technologies, disruptive innovation is discovered by contents analysis using technical information and market requirements.
Main data source is classified by technological data source and market data source; technological data source involves patent or paper databases that helps to identify technological specification or competitiveness, and market data source includes market report and opinion data for identifying customers’ requirements as well as market intelligence.
The proposed approach is able to give answers for two main questions; what is disruptive technology? and how is this disruptive technology implemented as product or service (furthermore, business model)?. At first, among candidates of disruptive technology, the most potential technology that to be disruptive in the market will be derived. It is classified by measuring the degree of disruptiveness based on features representing disruptive innovation not retrospective features. Second, the opportunity that becomes disruptive innovation can be discovered using disruptive technology. In other words, it is possible to derive new idea which can be implemented as product or service on the basis of the most potential disruptive technology. Deep learning is able to provide novel idea in the form of words or combinations of keywords that represent new concept of disruptive innovation.
Contribution to Scholarship
In order to explore disruptive innovation at the early stage, our study solves research questions by exploiting unsupervised learning with minimal depending on past data. The supervised learning based on training past data can be regarded as the retrospective analysis, which may bring about outdated results. Among a variety of techniques for unsupervised learning, deep learning that receives much attention in recent times will be used to detect disruptive innovation in advance. Although intrinsic characteristics of each database are different, this approach considered both technology and market perspectives. It results in opportunity discovery within the perspective of technological performance improvement as well as marketability. This study refined the concept and characteristics of disruptive innovation through reviews about dispersed literatures related to disruptive technology or innovation. Based on refined concept, the degree of disruptiveness can be measured by considering technological competitiveness and customers expectations more quantitatively and systematically.
Contribution to Practice
The proposed approach contributes to identify new ideas which have potential to invade mainstream market and enter niche market at the right timing. Ex-ante prediction enables to detect at the early stage and to react in time, which leads to gain profits with minimizing loss caused by innovator’s dilemma. Based on the proposed approach, practitioners are able to take competitive advantages through responding – concept generation, technology or product/service development - with pre-detected opportunities when other competitors threatened markets.
This paper can be well-suited to artificial intelligence and data science as well as radical and systemic innovation. Our study enables to find disruptive innovation systematically by data-driven approach considering both technological specification and market status or requirements together.
Adner, R. (2002). When are technologies disruptive? A demand‐based view of the emergence of competition. Strategic Management Journal, 23(8), 667-688.
Christensen, C.M. (1997). The Innovator’s Dilemma: When New Technologies Cause Great Firms to Fail. Boston: Harvard Business School Press.
Danneels, E. (2004). Disruptive Technology Reconsidered: A Critique and Research Agenda. Journal of Product Innovation Management. 21. 246-258.
Dotsika, F., & Watkins, A. (2017). Identifying potentially disruptive trends by means of keyword network analysis. Technological Forecasting and Social Change. 119. 114-127.
Govindarajan, V., & Kopalle, P. K. (2006). Disruptiveness of innovations: measurement and an assessment of reliability and validity. Strategic Management Journal, 27(2), 189-199.
Rafii, F., & Kampas, P. J. (2002). How to identify your enemies before they destroy you. Harvard Business Review, 80(11), 115-23.
Schuelke-Leech, B-A. (2018). A model for understanding the orders of magnitude of disruptive technologies. Technological Forecasting and Social Change. 129. 261-274.