Conference Agenda

19-PM1-08: ST1.3 - Data Science for Innovation Challenges
Wednesday, 19/June/2019:
1:00pm - 2:30pm

Session Chair: Paola Belingheri, University of Pisa
Session Chair: Filippo Chiarello, University of Pisa
Session Chair: Antonella Martini, University Of Pisa
Session Chair: Andrea Bonaccorsi, University of Pisa
Location: Amphi Lagarrigue

Session Abstract

The information field has changed dramatically over the past years, affecting the economy, technology, culture and society. However, these changes have left an even stronger mark on business systems (Jin 2015). Considering the mass of digital information produced in the past 10 years, companies have found themselves in a chaotic and constantly expanding digital universe. To innovate and stay competitive, companies must master methods and tools to prevent information overload, while gaining useful knowledge from the available data (Feng 2015).

The discipline of Data Science has emerged as a clear (although broad) field of research to solve data-related problems (Provost 2013). Data science is an interdisciplinary field that uses scientific methods to extract knowledge and insights from structured and unstructured data. It attracts researchers and encompasses methodologies from wide-ranging fields such as statistics, mathematics, information science, computer science , data analysis, machine learning and communication and is therefore an ideal tool to bridge the gap between research, industry and society (Waller 2013).

The objective of the present track is to collect works that use state-of-the-art Data Science tools and techniques to gather, transform, model and visualize data (Wickham 2014) to gain valuable information relevant for firm innovation. The scope is to use publicly available data to obtain a clearer view of which information sources contain the most untapped value and which methods and tools can be used to uncover it.

The main contributions are expected to highlight which information is relevant for different companies to build knowledge as a tool for innovation, in particular related to:

- data science for product innovation (Chiarello 2018a; Tan 2015): e.g. data-driven product development, A/B testing, patents analysis, product success evaluation, machine learning for innovation.

- data science for technology intelligence (Colladon 2018; Chiarello 2018b): e.g. brand analysis, competitors mapping, partners individuation, tools for knowledge visualization and communication.

- data science for open innovation & co-creation (Hoornaert 2017): e.g. papers mapping, open analytics, IP analysis, cloud computing.

- data science for new skill identification & mapping (Frey 2017): e.g. curricula analysis, job vacancies identification, job creation, new skills for innovation.

We expect to see contributions coming from the usual sources (e.g. open databases, patents, papers, social media) but we especially welcome contributions from less-known sources. Since Data Science is broad, we expect to showcase a wide range of methodologies such as machine learning, deep learning, natural language processing, image analysis or tools for data visualization and communication, to name a few.


Value Creation in Emerging Technologies through Sentiment Analysis of Scientific Papers: The Case of Blockchain

Filippo Chiarello1, Paola Belingheri1, Antonella Martini1, Andrea Bonaccorsi1, Andrea Fronzetti Colladon2

1Università di Pisa, Italy; 2Università di Perugia, Italy


Scientific papers contains a large quantity of valuable R&D information which is hard to automatically mine due to its unstructured form. Through an innovative application of text mining we are able to identify advantages and disadvantages of specific technological solutions from scientific literature highlighting opportunities for value creation.


Value creation occurs when a process or an activity manage to raise the novel and appropriate benefits that users receive from its output (Lepak et al., 2007). It often results from the resolution of a technological problem which either prevents a product or service from performing or from being used to its full potential (Adner & Kapoor, 2010). These definitions all require an ex-post evaluation of the created value, whereas data mining techniques could represent a new approach.

Value creation processes include activities such as R&D, invention, and innovation (Lepak et al., 2007). Within emerging industries, key knowledge is disseminated through unstructured documents such as papers and is challenging to retrieve (Chen et al., 2012). Recently, sentiment analysis has been used to extract advantages (benefit, gain, profit) and disadvantages (drawback, failure) of technologies from patents (Chiarello, 2017). This methodology could also benefit emerging industries in identifying value creation opportunities.

Literature Gap

Literature on value creation has focused on defining processes of value generation (Lepak et al., 2007), locating challenges in the firm’s environment (Adner & Kapoor, 2010) and demand-side perspectives (Priem, 2007); there is thus a gap in the ex-ante identification of the most promising avenues for value creation.

Research Questions

Using the perspective of value creation this paper will develop a new methodology to list the problems of a technological field and the solutions currently being pursued, correlate problems between them and map the research agenda. In doing so we will identify key attributes of value creation within emerging technologies.


The approach we adopted is new with respect to the ones established in the literature (i.e. bibliometric analysis), that doesn’t allow a deep understanding of the texts’ technical content.

To compute technical sentiment, we used a lexicon developed by the authors that extract advantages and disadvantages of inventions from patents (Chiarello, 2017) and a novel dictionary lookup approach that incorporates weighting for valence shifters. Following a bottom up-approach, we redesigned and updated these lexicons for an optimal application to scientific papers.

Finally, the methodology has been applied to case-study made of a set of scientific papers on blockchain technology.

Empirical Material

Our dataset was composed of 1276 papers on the topic of blockchain extracted from Scopus in May 2018, through the query: TITLE-ABS-KEY ( "blockchain" OR "block chain" OR "block-chain"). The data was then cleaned to account for the usage of the term blockchain and its morphological variations in other fields different from computer science and economics (i.e. chemistry (Dong 2013)). To increase the dataset’s precision, we also filtered using rules based on the content of the abstracts, checking for a blacklist of generic words contained in scientific articles. This resulted in a dictionary of words that have been used to search and filter all the papers containing in the abstract one or more of the words in the technical sentiment lexicon.

The texts were then pre-processed using state of the art natural language processing tools (sentence splitter, tokenizer, lemmatization) which resulted in a total of 8143 sentences to be analyzed through our sentiment analysis algorithm.


The Natural Language Modelling process has been applied in a specific case study on the technology of blockchain. We chose this technology because it is highly innovative, it is having an impact in many different fields ( thus would be interesting to see if the automatic system will be able to catch all of them) and the problems of this technology are both technical and managerial.

The results of the case study will be: i) List the problems of the technological field of blockchain, ii) Link problems with disciplinary fields, iii) Analyze the dynamic of the problems over time and iv) Correlate problems between them.

Even if the research is still ongoing, we started to explore the preliminary outputs of the system. To navigate the output we developed a graph of problems, where a node represents a word contained in the sentences describing the problems and an edge exists if two words appear more than 2 times in the same abstract. The dimension of a node is proportional to its degree and the color of nodes represent the clusters (automatically computed). A preliminary navigable version of the graph is available following the link

Contribution to Scholarship

The first main contribution of our work will be to give scholars a tool that can clearly map the problems to be solved within a technological field. This map can be used to identify the direction of research that nowadays has the most probability of having an impact on society.

Secondly, we will contribute to the field of innovation science, since the tool can be used ex-post to give a map of the evolutions of problems to be addressed (or yet addressed) in a field, thus giving a new point of view on the trajectories of innovations and indicating the ones that have the most potential of Value creation.

Lastly, the theoretical contribution of our work will also impact Natural Language Processing and Text Mining fields since we present an innovative method of information extraction from unstructured sources.

Contribution to Practice

The application scenarios of such output and the others expected from the proposed methodology have a wide range of users:

1- Companies that want to rapidly map a certain technological field

2- Policymaker that wants to invest to solve problems of technology to boost its innovation

3- Editors to understand their positioning

4- Research networks to better define their strategy and to exploit possible collaborations and synergies between scholars.

To maximize the contribution to practice of the present work we planned to develop an R (statistical software) package that can boost the reproducibility of the developed method.


This research seeks to bridge the gap between research and industry, by developing and demonstrating a tool that will systematically analyze scientific publications in emerging technology to identify the most promising avenues for value creation. Thereby our work will also contribute to creating societal value.


Adner, R., & Kapoor, R. (2010). Value creation in innovation ecosystems: How the structure of technological interdependence affects firm performance in new technology generations. Strategic management journal, 31(3), 306-333.

Chen, H., Chiang, R.H.L., Storey, V.C., 2012, Business intelligence and analytics: From big data to big impact, MIS Quarterly: Management Information Systems, Volume 36, Issue 4, 2012, Pages 1165-1188

Chiarello, F., Fantoni, G., Bonaccorsi, A., 2017, Product description in terms of advantages and drawbacks: Exploiting patent information in novel ways, ICED, 101-110

Ji, X., Dong, S., Wei, P., Xia, D., & Huang, F. (2013). A Novel Diblock Copolymer with a Supramolecular Polymer Block and a Traditional Polymer Block: Preparation, Controllable Self‐Assembly in Water, and Application in Controlled Release. Advanced Materials, 25(40), 5725-5729.

Lepak, D. P., Smith, K. G., & Taylor, M. S. (2007). Value creation and value capture: a multilevel perspective. Academy of management review, 32(1), 180-194.

Towards Automatic building of Human-Machine Conversational System to support Maintenance Processes

Elena Coli1, Nicola Melluso2, Gualtiero Fantoni2, Daniele Mazzei3

1Department of Information Engineering, University of Pisa, Italy; 2Department of Civil and Industrial Engineering, University of Pisa; 3Department of Computer Science, University of Pisa


The Industry 4.0 paradigm is introducing many cognitive changes companies have to deal with. In this constantly changing environment, knowledge management is a key factor. As a consequence, technical documents become fundamental within companies as they could be sources to automatically generate new knowledge.


The Acatech study on the new “Industrie 4.0” [1] recognizes knowledge management as one of the missing building blocks of the Fourth Industrial Revolution.

Currently, the knowledge management model consists of six core phases [2]: generate, refine, store, transfer, share and use knowledge.

One of the main applications of technical texts is the development of dialogue systems, machines able to hold a conversation with another agent or with a human [3]. The development of these systems provides the application of knowledge management model. Although, its abstraction and generality make it difficult to be applied to real cases. This is the reason why dialogue systems are currently built following the retrieval-based model, using a repository of pre-defined responses [4], and written manually, first identifying the needed functions, and then planning the interactions with users. This process, besides being time-consuming, cannot be replicated on different cases.

Literature Gap

Conversely, a conversational system, also referred to as chatbot, can be built from scratch by simply extracting rules from technical documentation, using a knowledge-based approach. Such rules will be then fed to an engine (expert system) suitable to manage the chatbot. This makes the chatbot construction fast and cost-efficient.

Research Questions

The research goal is to automatically build the answer system of a chatbot, the part of the chatbot able to reply to the users, using text mining techniques. The system will interact with operators, supporting them and making them able to carry out maintenance operations.


As a first step, the maintenance manual was pre-processed using text mining techniques. Meanwhile, some lists of concepts (groups of entities, verbs or logical relations) expected to be found in the manual, have been generated. Then, these lists were automatically expanded. Once enough concepts for an automatic analysis were created, the regular expressions for each type of concept were extracted from the manual. Regular expressions are sequences of characters that define a search pattern [5]. They allowed us to automatically identify other concepts in the manual and therefore to expand the concepts’ lists.

Empirical Material

The methodology was applied to a maintenance manual provided by BOBST, a global company that produces presses for rotogravure printing and coating and laminating machines for the flexible materials industry. The manual describes the maintenance operations of a flexographic printing machine for labels and flexible packaging.

In order to expand the lists of concepts following the methodology described in the previous section, different data sources were used. Among these, we can mention functional verbs [6], Wikipedia and WordNet.


The project allowed the automatic building of a knowledge-base for a chatbot, by applying text mining techniques. The knowledge-base will be used to structure a first version of a chatbot answer system for BOBST company.

Although the results make us confident in methodology improvements, this is only the first step of a long journey. In this work, the new methodology was applied to a well-formalized text, since BOBST is a multinational company with a particularly advanced technical documentation. Knowing this, future experiments with different maintenance manuals will provide us with the material to evaluate the necessity for improvements, both in the text’s pre-processing and in concepts’ generation and expansion.

The following critical step will be building the question system, the part of the chatbot able to understand the questions of the users. The integration of answer and question system will constitute the first prototype of the chatbot.

Contribution to Scholarship

The contribution to scholarship is mainly identifiable in the application of text mining techniques to companies’ technical documentation. Moreover, in this work, text mining techniques have been applied to unstructured texts, that do not have a precise scheme or format: for this reason, their pre-processing and processing are more difficult.

The development of the methodology paves the way for the development of many other research projects.

Contribution to Practice

The contribution to practice is clear and relevant, since the work has been carried out with the commitment of a multinational company like BOBST. The contribution could be seen by two different sides: firstly, the goal of the research is driven by questions and challenges coming from practical needs; secondly, the methodology is designed to be applied to a real case and to be potentially extended and re-used. The project also assumes a strong value from the managerial point of view, as it allows both, the achievement of an objective and the optimization of the necessary resources.


This research is strongly linked to this year’s conference theme. In fact, since chatbots support knowledge management, their automatic development, performed with text mining techniques, is a relevant issue for companies in order to face innovation.


[1] Schuh, G., Anderl, R., Gausemeier J., ten Hompel, M., Wahlster, W. (2017). Industrie 4.0 Maturity Index. Managing the Digital Transformation of Companies (acatech STUDY), Munich: Herbert Utz Verlag.

[2] King, W. R. (2009). Knowledge management and organizational learning.

[3] Masche, Julia & Le, Nguyen-Thinh. (2018). A Review of Technologies for Conversational Systems. doi: 10.1007/978-3-319-61911-8_19.

[4] Ramesh, K., Ravishankaran, S., Joshi, A., & Chandrasekaran, K. (2017). A survey of design techniques for conversational agents. doi:10.1007/978-981-10-6544-6_31.

[5] Ruslan Mitkov (2003). The Oxford Handbook of Computational Linguistics.

[6] Fantoni G., Apreda R., Bonaccorsi A. (2009). Functional vector space.

Predicting the firms' innovation performance with patents and machine learning algorithms

Raffaella Manzini1, Gloria Puliga1, Linda Ponta1, Luca Oneto2

1LIUC Università Cattaneo, Italy; 2Università di Pisa, Italy


Innovation Capability (IC) can be defined as the firm's ability to acquire and valorize interior experiences, mobilize and create new knowledge that will result in product and/or process innovation able to satisfy present and future market needs, through the use of different innovative, tacit, non-modifiable and closely assets.


Several studies are Investigating the determinants of IC, in order to understand whether and how they can be measured and used to forecast the future innovation capability of companies.

IC determinants are grouped into internal and external factors (Romijn and Albaladejo, 2002). The former refers to the skills brought into the firm by the workforce, the latter to the interactions of firms with other companies or scientific actors that complement internal knowledge.

As a measure of IC, several studies use the number of patents (Stern and Porter, 2000; Manzini and Lazzarotti, 2016) and the patent forward citations (Lee et al., 2018; Bessen, 2008). Forward citations represent a suitable indicator to measure IC as they talk about the novelty, quality and market acceptance of an innovation.

So far, mainly qualitative methods or linear regressions have been used to investigate the relation between firms’ IC and its determinants (Romijn and Albaladejo, 2002).

Literature Gap

Recently, increasingly complex methods and new sources have been proposed to overcome contrasting results about how IC determinants influence IC, but further investigation is still needed (Love and Roper, 2015; Kim et al., 2018). In particular, the use of machine learning algorithms is still in an emerging state.

Research Questions

The aim of the paper is to show how the most diffused machine learning algorithms applied in real world applications, i.e. Regularized Least Squares, Deep Neural Networks and Random Forest decision trees, can provide a relevant contribution to the measurement and forecasting of the IC of firms.


Thanks to the large availability of patent data, a machine learning approach is proposed. Patents are used as source of innovation data: patents forward citations as proxy of IC, and patents features (e.g., the number of backward citations, the number and the type of technological classes etc.) as proxy of IC determinants. Three different algorithms of machine learning, Regularized Least Sqares (RLS), Deep Neural Networks (DNN) and Random Forest decision trees (RF) are employed to capture the relationships between input features (IC determinants) and output features (IC).

Empirical Material

The Orbit Intelligence database, provided by Questel, one of the world’s leading intellectual property management companies, is used as data source. Patents are collected for the EU area from 2005 to 2018. A total of 1.635.734 patents family and a total of 8.093.138 patents for the EU area are collected. The total patents considered are divided by country considering where firms have the registered office or where inventor has the residence.


All the three algorithms adopted have a low error and the best training algorithm is the Deep Neural Networks and Decision Trees with Random Forest. Considering that the Random Forest is the most robust and easy to train algorithm, the RF has been chosen as tool to extract the main significant features (Wainberg et al., 2016). For the countries analysed the same patents features have been extracted as important in predicting the forward citations and thus in estimating the IC. In particular, the most relevant features are the technological classes, based on the IPC, the technological domain, the number of backward citations. This result shows that the most important determinants to predict IC are mainly internal. Other features, such as the number of words or the independent number of claims are not important in determining IC.

Contribution to Scholarship

The paper gives a contribution in the study of how IC determinants influence IC. Compared to more qualitative studies, machine learning approaches allow to generalize the findings and application to several contexts. With respect to the classical regression models, they permit to map cause-effect relationships, use a set of wider, noisier and more complex data and model the relationships combining linear and not linear functions in the same model. Results are better also in terms of accuracy and interpretability. In particular, for what concerns the accuracy, using a machine learning approach it is possible to compare different algorithms and to choose the one with the best performance. For what concerns the interpretability, using this approach it is possible to understand the contribution of the various IC determinants to the IC.

Contribution to Practice

From a managerial point of view, this paper provides managers with some insights on the variables they can manage in order to improve their IC. Results show the great relevance of internal determinants in predicting IC: technological classes, backward citations especially. So companies must be aware that their capacity to absorb previous knowledge and make use of it will strongly affect future innovation ability (Harhoff et al., 2003; Hall et al., 2007). Some external determinants are also relevant: the size of the patents family, the continuity of investment especially. And Time related variables matter.


This contribution deals with the application of Machine Learning Approaches for improving research in the field of innovation management and for facing relevant challegnes in the management of innovation.


Bessen J. (2008). The value of us patents by owner and patent characteristics, Research Policy, 37, 5, 932–945.

Hall, B. H., Thoma, G., and Torrisi, S. (2007). The marker value of patents and R&D: evidence from European firms. Academy of Management Proceedings, 1, 1–6.

Harhoff, D., Scherer, F. M., and Vopel, K. (2003). Citations, family size, opposition and the value of patent rights. Research Policy, 32, 8, 1343–1363.

Kim, M.-K., Park, J.-H. and Paik, J.-H. (2018). Factors influencing innovation capability of small and medium-sized enterprises in Korean manufacturing sector: facilitators, barriers and moderators. International Journal of Technology Management, 76, 3-4, 214–235.

Lee C., Kwon O., Kim M. and Kwon D. (2018). Early identification of emerging technologies: A machine learning approach using multiple patent indicators, Technological Forecasting and Social Change, 127, 291–303.

Love, J. H. and Roper, S. (2015). SME innovation, exporting and growth: A review of existing evidence. International Small Business Journal, 33, 1, 28–48.

Manzini R. and Lazzarotti V. (2016). Intellectual property protection mechanisms in collaborative new product development, R&D Management, 46, 579–595.

Romijn H. and Albaladejo M. (2002). Determinants of innovation capability in small electronics and software firms in Southeast England, Research Policy, vol. 31, no. 7, 1053–1067.

Stern S., Porter M. E. and J. L. Furman (2000). The determinants of national innovative capacity, National bureau of economic research, Tech. Rep.

Wainberg, M., Alipanahi, B. and Frey, B. J. (2016). Are random forests truly the best classifiers? The Journal of Machine Learning Research, 17, 1, 3837– 3841.

Perspectives for Social Robotics: Towards RoboEthics and Therapeutic Applications

Luis Daniel Bolanos1, Filippo Chiarello2, Daniele Mazzei3

1Department of Information Engineering, University of Pisa; 2Dept. of Energy, Process and System Engineering, University of Pisa; 3Computer Science Department, University of Pisa


Social Robotics (SR) as a field has been consolidating itself and its fast growth has not been unidirectional, but rather towards a multidisciplinary scenario, needing actors from different sectors of the society. This rapid expansion calls for a clear and instructive roadmap, to design guidelines and a field perspective


Social robots have been classified based on their level of engagement with the user (Fong, Nourbakhsh, & Dautenhahn, 2003) and based on their application field (Goodrich & Schultz, 2007; Leite, Martinho, & Paiva, 2013). Large has been as well the discussion concerning the most urgent traits for robots to have, and in general the direction towards which the field should aim (Fong et al., 2003; Goodrich & Schultz, 2007; Leite et al., 2013; Paiva, Leite, Boukricha, & Wachsmuth, 2017). To the reach of our knowledge, only one research performed a bibliometric analysis to the field of SR based on topological clusters connected by citation networks (Mejia & Kajikawa, 2017). 8 topics where found for the field of SR, namely robots as social partners, ergonomics in HRI, for children development, swarm robotics, emotion detection, assessment of surgical robots, robots for the elderly, and rescue robots.

Literature Gap

Even though efforts have been made on identifying relevant topics and their proportion inside the field, no information about their time course exists, meaning that it is not objectively known which ones are on the rise and which ones are losing popularity.

Research Questions

Which topics inside the field of SR have been trending and which have ones been losing popularity in the last 10 years?


Approaching the task quantitatively by using a paradigm known as topic modelling (TM) on the articles pertaining the field fits naturally the requirements, especially because of the a-posteriori distributions of the topics over the corpus informing in a sense about topic size/relevance. Here, a Latent Dirichlet Allocation (LDA) model was fitted to a corpus of abstracts. Semantic coherence for the topics was estimated, and linear trends in the last 10 years were found.

Empirical Material

11365 abstracts were gathered on December 12th 2018 from SCOPUS through the query "social robot*"


A 70 topics LDA model was fitted to the corpus after finding it to be the optimal amount of topics. Positive trends in the last 10 years concerned RoboEthics, passing from 14.3% to 16.5% of all publications (p<0.05) and robots for patients with Autism Spectrum Disorder (ASD) (14.3% to 15%, p<0.05), while negative ones regarded Swarm Robotics (15.2% to 14.8%, p<0.05), and contingent robots (14.9% to 14.3%, p<0.05). Furthermore, the subset of article pertaining to Social robotics and ASD was analysed as well individually. Nao is without any doubt the most used robot in the field of SR. However, it is only used 17% of the time in ASD treatment research. Instead, Zeno, Kaspar and Keepon are less popular in overall, but 55%, 54% and 60% of the times, respectively, are used in ASD. When an LDA model was fitted to this subset Nao was found to be positively trending (p < 0.05), passing from 4.7% to 5.2%, while that using Kaspar decayed and went from covering 6.2% to 5%.

Contribution to Scholarship

This work will provide yet another holistic picture of the field, so that academics may identify where it is best to invest resources regarding trending topics, and to detect missing or underdeveloped topics that are worth researching on.

Contribution to Practice

Outside academia, the work may inspire and empower entrepreneurs in order to coordinate efforts and to create strong cooperation agreements between industry and academia to produce high quality products matching both market needs and scientific standards. Additionally, this explorative method could be translated to other research fields/areas to exploit big corpus of data.


SR is experiencing an exponential growth and is a generator of innovative ideas that are perfusing not only the academia but also the market, as evidenced lately by the number of start-ups offering social companions/assistants that are shaping the way people live and creating new market opportunities


Fong, T., Nourbakhsh, I., & Dautenhahn, K. (2003). A survey of socially interactive robots. Robotics and Autonomous Systems, 42, 143–166.

Goodrich, M. A., & Schultz, A. C. (2007). Human-Robot Interaction: A Survey. Foundations and Trends® in Human-Computer Interaction, 1(3), 203–275.

Leite, I., Martinho, C., & Paiva, A. (2013). Social Robots for Long-Term Interaction: A Survey. Int J Soc Robot, 5, 291–308.

Mejia, C., & Kajikawa, Y. (2017). Bibliometric Analysis of Social Robotics Research: Identifying Research Trends and Knowledgebase. Applied Sciences, 7(12), 1316.

Paiva, A., Leite, I., Boukricha, H., & Wachsmuth, I. (2017). Empathy in Virtual Agents and Robots. ACM Transactions on Interactive Intelligent Systems, 7(3), 1–40.