Data Scientist Roadmap 2025

Published: Jan. 8, 2025, 9:13 a.m. (UTC) / Updated: Jan. 8, 2025, 9:27 a.m. (UTC) 🔖 1 Bookmarks
👍 1 👎 0
English

Unraveling the Evolution of Data Scientists Through Technological and Societal Changes

Introduction

In this document, we will first look back at the history of how the roles and required skills of data scientists have evolved alongside technological advancements. We will then explore the anticipated changes in the role of data scientists over the next decade. Against the backdrop of technological and societal changes, we will analyze the step-by-step evolution of the data scientist's role—from automating simple manual tasks, through advanced decision-making support, to managing AI systems across society. Furthermore, we will examine the impact of these changes on society and the new skills that data scientists will need to acquire.

0. Step 0: The Birth of Data Scientists

Data analysis itself has a long history. For this document, we define Step 0 as the period from the emergence of the "data scientist" profession to the rise of deep learning. This section will summarize the historical background that led to the birth of the data scientist role, the problems it was expected to solve, and the skill sets required at that time.

1. Historical Background

  • The Arrival of the Big Data Era: A massive amount of data began to be generated and accumulated. The big data era emerged as a result of the simultaneous evolution of technology, explosive growth in data generation, and an increased recognition of data's value. Between 2008 and 2012, the concept of big data became widely recognized, leading to advancements in technology and infrastructure. During this period, big data evolved from being a mere buzzword to becoming an essential element of business and scientific research. Simultaneously, the demand for cloud computing gave rise to services like Amazon EC2, Amazon Redshift, Google App Engine, and Google BigQuery.

  • Improved Computing Power: Enabled more complex and large-scale analyses. Particularly, advancements in NVIDIA's CUDA architecture (from 2006 onward) drove GPU utilization, enabling parallel processing through distributed computing. This made it possible to analyze terabyte- and petabyte-scale data, train and infer deep learning models, and learn from large-scale corpora for language modeling.

  • Advancements in Machine Learning Algorithms: Increased the importance of predictive modeling.

    • From decision trees to random forests: In 2001, Leo Breiman proposed random forests, an ensemble method that is more accurate and less prone to overfitting than single decision trees.
    • The evolution of support vector machines (SVMs): From the late 1990s to the 2000s, advancements in kernel methods enabled nonlinear classification. SVMs gained recognition for their effectiveness in high-dimensional data and were widely adopted across various fields.
    • The emergence of gradient boosting: In 2001, Jerome H. Friedman proposed gradient boosting machines. Notably, the introduction of XGBoost in 2016 enabled even higher performance and faster predictions.
    • The resurgence of neural networks: In 2006, Hinton et al. proposed Deep Belief Networks (DBNs), laying the foundation for modern deep learning. This technology automated feature learning and provided an effective method for training deep networks, revolutionizing data science practices. DBNs' success paved the way for the development of Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and other innovations, forming the cornerstone of the modern AI boom. DBNs also led to the debut of AlexNet by Hinton et al. in 2012, marking the beginning of Step 1.

2. The Emergence of the Data Scientist Profession

In 2008, D.J. Patil and Jeff Hammerbacher first coined the term "data scientist" while working at LinkedIn and Facebook, respectively, to describe a new role focusing on large-scale data analysis. Notably, the term "data scientist" was chosen partly for its appeal in the context of recruitment.

3. The Role of Data Scientists:

When the term "data scientist" was first coined, analytical techniques and statistics were already well-developed, requiring considerable effort to master these concepts and technologies. Thus, it was inevitable for data scientists to be established as a distinct profession. At that time, the lack of comprehensive open-source libraries, as compared to the modern era, made implementation more challenging. In this context, data scientists conducted analyses that combined statistics, computer science, and domain knowledge specific to various fields (e.g., image processing, natural language processing). Despite the "big data" narrative, relatively few companies had sufficient data for meaningful analysis, and the roles of "data scientist" and "machine learning engineer" were not yet clearly differentiated. Within this framework, the roles of data scientists during that period included the following:

  • Transforming Business Challenges into Data-Driven Solutions
    For example, data scientists addressed various industry-specific challenges:

    1. Customer churn prevention in retail: Analyzing purchase history, website behavior data, and customer service interactions to identify at-risk customers and implement retention campaigns. A well-known example is churn analysis in the telecommunications industry.
    2. Reducing defect rates in manufacturing: Collecting and analyzing sensor data, production line settings, and environmental data to identify factors affecting quality variation. Predictive models were developed to provide real-time recommendations for optimal production settings.
    3. Fraud detection in financial services: Analyzing transaction data, customer profiles, and location data to develop fraud detection models. Real-time scoring systems flagged suspicious transactions immediately, helping to prevent losses due to credit card fraud.
    4. Improving campaign ROI in marketing: Using past campaign data, customer segmentation, and response rates to build predictive models. These models optimized campaign types and response predictions for each customer segment.
    5. Optimizing delivery routes in logistics: Integrating GPS data, traffic information, weather data, and order history to calculate efficient daily delivery plans using optimization algorithms.
  • Computer Science Skills:

    • Examples of key modeling and analysis techniques:

    • Feature engineering:

      • Data scientists manually designed features using domain knowledge.
        • Example 1: For image recognition tasks, they manually extracted features such as edge detection, texture analysis, and color histograms.
        • Example 2: For natural language processing, they manually designed features such as TF-IDF, n-grams, and part-of-speech tags.
    • Expected approaches:

      • Problem-solving through statistical machine learning
      • Translating big data within companies into mathematical problems
      • Fusing domain knowledge with mathematical science to drive business impact
  • Engineering Skills:

    • Mastery of distributed processing techniques and GPU programming: Technologies like Hadoop (an open-source framework enabling distributed processing of large datasets) and MapReduce (a distributed processing programming model proposed by Google) were essential for dividing large datasets into smaller chunks and processing them in parallel.
    • Proficiency in cloud computing environments
    • Algorithm design for large-scale data processing
    • Skills in optimizing system architecture
    • Effective use of cost-efficient computing resources
    • Managing hybrid environments of on-premises and cloud systems
  • Ethical Skills:

    • Addressing privacy issues associated with large-scale data processing
    • Ensuring fairness and transparency in models

1. Step 1: The Rise of Deep Learning (Early 2010s)

Before reaching the current phase of AI/machine learning development (Step 2), there was a significant turning point in the field of data science. The rapid advancement of deep learning, particularly starting around 2012, brought revolutionary changes to data science and AI. This period is referred to here as "Step 1."

1.1 Background of Step 1: Major Changes Brought by the Rise of Deep Learning

  1. Breakthroughs in Image Recognition:

    • AlexNet’s Victory in the 2012 ImageNet Competition:
      Before AlexNet, traditional machine learning methods relied on manually crafted features, which struggled with complex visual tasks. AlexNet's architecture, featuring multiple convolutional layers, demonstrated the ability of deep neural networks to automatically learn complex hierarchical representations of data. This breakthrough enabled more accurate and efficient image recognition by extracting meaningful features directly from raw data using deeper architectures.
    • Scalability and Generalization Performance:
      AlexNet showcased the scalability and generalization power of deep neural networks. Despite the substantial increase in model complexity and the number of parameters, AlexNet demonstrated remarkable generalization abilities, effectively recognizing objects across diverse categories and perspectives in images.
    • Key Breakthroughs in AlexNet:
      • Effective Use of ReLU: Although ReLU was theoretically known, AlexNet was the first to successfully use it in large-scale CNNs. This effectively addressed the vanishing gradient problem, allowing neural networks to be stacked into deeper architectures. The simplicity of ReLU also contributed significantly to faster training.
      • Dropout Regularization: Introduced by Hinton and colleagues, this new technique significantly reduced overfitting and improved model generalization. The sparsity induced by ReLU complemented dropout regularization, leading to synergistic effects.
      • GPU Utilization: Hinton's team leveraged GPUs for parallel processing, drastically reducing training time.
      • Automated Feature Extraction and Learning from Massive Data: These advancements enabled the network to directly learn features from over 1.2 million images in ImageNet without manual feature engineering. Data augmentation further bolstered the model's learning capabilities, marking a major breakthrough at the time.
  2. Advancements in Natural Language Processing:

    • Introduction of Word Embedding Techniques like Word2Vec:
      The emergence of word embedding techniques, such as Word2Vec, began in the early 2010s (notably in 2013) and rapidly evolved. These techniques revolutionized natural language processing (NLP), significantly altering the role and required skill set of data scientists. The continuous development of embedding techniques has consistently improved the performance of NLP tasks.
    • RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory):
      These architectures improved sequence data processing by capturing long-term dependencies and contextual information, leading to transformative advances in tasks like machine translation, sentiment analysis, and text generation.
  3. Emergence of Highly Generalizable Architectures:

    • Application of Convolutional Neural Networks (CNNs) to Non-Image Tasks:
      CNNs were applied to diverse fields beyond image processing, including natural language processing, audio processing, time-series analysis, and genome analysis.
    • Encoder-Decoder Models for Unified Task Handling:
      Encoder-decoder models facilitated a unified approach to various tasks:
      • Machine Translation: Input as a sentence in the source language and output as a sentence in the target language.
      • Text Summarization: Input as a long text and output as a short summarized text.
      • Speech Recognition: Input as audio signals and output as text strings.
      • Image Captioning: Input as an image (encoded with a CNN) and output as a descriptive sentence.
  4. Recognition of the Importance of Computational Resources and Data:
    The widespread adoption of GPU-based parallel computing and the growing recognition of large-scale datasets were revolutionary for deep learning and the broader field of data science. These developments enabled the explosive growth of deep learning during Step 1, dramatically expanding the scope of AI and data science applications. Tasks previously thought impossible became feasible, leading to groundbreaking impacts in industries and scientific research. Consequently, data scientists were compelled to continuously upgrade their skills and acquire broader knowledge to adapt to these new technologies and challenges.

    • Generalization of GPU-Based Parallel Computing:

      1. Dramatic Improvement in Computational Speed:

        • Training deep learning models became over 100 times faster compared to CPUs.
        • Example: AlexNet’s training time reduced from weeks to days.
      2. Realization of More Complex Models:

        • Significantly increased the feasible depth of neural networks.
        • Example: Training deep networks like ResNet with 152 layers became practical.
      3. Enabling Real-Time Processing:

        • Real-time applications in video analysis and natural language processing became possible.
        • Example: Real-time object detection and machine translation.
      4. Shortened Experimentation Cycles:

        • Model improvement and evaluation cycles became significantly shorter.
        • Accelerated the pace of research and development.
      5. Cost Reduction:

        • Enabled advanced computations without expensive supercomputers.
        • Made cutting-edge research accessible to individuals and small teams.
    • Increased Importance of Large-Scale Datasets:

      1. Significant Improvement in Model Performance:

        • Learning from diverse and massive datasets enhanced model generalization capabilities.
        • Example: Improved image recognition accuracy through the ImageNet dataset.
      2. Emergence of New Applications:

        • Novel AI applications leveraging large-scale data emerged.
        • Example: Advanced text generation by large language models like GPT.
      3. Development of Transfer Learning:

        • Pretrained models on large-scale datasets were adapted for various tasks.
        • Achieved high performance even on small-data tasks.
      4. Recognition of Data Quality Importance:

        • Reinforced the principle of "garbage in, garbage out."
        • Highlighted the need for data cleaning and quality management.
      5. Privacy and Security Challenges:

        • Ethical issues arose from handling massive personal data.
        • Fostered development of technologies like data anonymization and differential privacy.
  5. Proliferation of Open-Source Frameworks:

    • Emergence of TensorFlow, PyTorch, Keras, and others.
    • Lowered implementation barriers and accelerated research.
  6. Groundbreaking Achievements in Deep Learning

    1. Surpassing Human-Level Image Recognition:

      • In 2015, ResNet achieved image recognition accuracy surpassing average human performance.
      • AI began excelling in expert-level tasks, such as medical image diagnosis.
    2. Significant Improvements in Natural Language Processing:

      • Machine Translation: While Google Translate of the time was primitive compared to today's standards, it marked a major leap in quality at the time.
      • Question Answering: IBM Watson gained fame for its early successes, partly due to IBM's marketing strategies. Despite the hype, Watson faced challenges in meeting expectations. Although IBM no longer prominently markets Watson, its core technologies are integrated into various IBM products and services.
    3. Rapid Advances in Autonomous Driving Technology:

      • Enabled real-time environmental perception and decision-making.
      • Advances in computer vision brought the practical implementation of autonomous vehicles closer to reality.
    4. AI Applications in Creative Tasks:

      • AI began generating music and creating artwork.
      • Technologies like DeepDream and Style Transfer emerged.
    5. Acceleration of Scientific Research:

      • In 2020, AlphaFold2 by Google DeepMind solved the protein folding problem. Proteins, as fundamental components of life, have their functions determined by their 3D structures. While proteins are synthesized as amino acid chains (primary structures), understanding how they fold into specific 3D structures remained a long-standing challenge in biology and biochemistry. This breakthrough significantly impacted life sciences.
      • Accelerated drug discovery processes.
  7. Impact of Deep Learning on Industry

    1. Surge in AI Startups:

      • Emergence of new services leveraging deep learning.
      • Increased venture capital investment in the AI sector.
    2. Shift in AI Strategies of Major Tech Companies:

      • Establishment of AI research labs by companies like Google, Facebook, and Amazon.
      • Intensified competition for AI talent.
    3. Acceleration of AI Adoption in Traditional Industries:

      • Growing use cases of AI in manufacturing, finance, healthcare, and other sectors.
      • Applications included predictive maintenance, anomaly detection, and personalized medicine.
    4. Emergence of Ethical and Social Challenges:

      • Concerns over the fairness and transparency of AI decisions.
      • Heightened debates around privacy and data usage.

1.2 The Evolving Role of Data Scientists

Technological shifts, such as the rise of deep learning, have transformed problem-solving approaches in the field of data science. These transformations are primarily characterized by two major shifts: the transition from feature engineering to model architecture design and the adoption of end-to-end learning approaches.

Shift from Feature Engineering to Model Architecture Design

The advent of deep learning has significantly altered the focus of data scientists. Previously, manual feature design and extraction were central tasks. However, deep learning models can automatically learn features from raw data, leading to a shift in focus towards designing effective neural network architectures. Examples include:

  • The configuration of layers in Convolutional Neural Networks (CNNs).
  • The design of self-attention mechanisms in Transformer models.
  • The structural optimization of generators and discriminators in Generative Adversarial Networks (GANs).

This shift enables the development of more generalizable models that can be applied across various tasks using the same architecture. It also reduces time spent on data preprocessing, allowing greater focus on model design and experimentation. Despite these changes, domain knowledge remains crucial, with its application shifting from feature design to model architecture design.

Adoption of End-to-End Learning Approaches

The adoption of end-to-end learning has introduced another significant transformation. Traditional pipeline approaches consisted of multiple independent steps (e.g., preprocessing, feature extraction, classification), each optimized separately. In contrast, end-to-end approaches train a single unified model that learns all steps simultaneously, from input to output.

This approach has been particularly impactful in fields such as speech recognition and autonomous driving:

  • In speech recognition, traditional multi-step processes have been replaced by models that directly convert audio into text using deep neural networks.
  • In autonomous driving, separate processes for object detection and path planning are integrated into neural networks that directly generate control commands from sensor inputs.

End-to-end learning simplifies systems by reducing intermediate steps, enhancing overall performance through joint optimization. However, this approach also increases reliance on large, high-quality datasets and introduces challenges related to model interpretability, as intermediate processes become less transparent.

Implications for the Role of Data Scientists

These shifts have evolved the role of data scientists, requiring a blend of system-level understanding and detailed knowledge of machine learning theory and implementation. Key changes include:

  • Greater emphasis on the collection and management of large and diverse datasets.
  • Increased focus on improving model interpretability and explainability.
  • Enhanced communication skills for collaborating with domain experts to define end-to-end objectives.

Furthermore, the diverse skill sets now required have led to functional differentiation within teams, such as distinct roles for data scientists, machine learning engineers, and other specialists.

1.3 Differentiation Between Data Scientists and Machine Learning Engineers

While data scientists and machine learning engineers are often described as distinct roles, there is significant overlap in their responsibilities. The extent of this overlap varies across organizations. This section provides one perspective on how these roles can be distinguished.

Key Focus and Skills

Data Scientist

  • Primary Focus: Data analysis, solving business problems, extracting insights.
  • Key Skills:
    • Statistics, data visualization, business acumen.
    • Expertise in machine learning theory and implementation.
  • Tools: Python, SQL, R, statistical software, business intelligence tools.

Machine Learning Engineer

  • Primary Focus: Development, implementation, and system-level optimization of advanced machine learning models.
  • Key Skills:
    • Deep learning, algorithm design, large-scale system implementation.
    • Model optimization and deployment.
  • Tools: TensorFlow, PyTorch, cloud platforms, GPU computation.

1.3.1 Skill Set Requirements for Data Scientists

  1. Theoretical Foundation

    • Deep understanding of deep learning concepts.
    • Familiarity with network architectures (e.g., CNNs, RNNs, Transformers).
    • Domain-specific knowledge (e.g., computer vision, natural language processing).
    • Mathematical foundations, including linear algebra, probability, and optimization.
    • Knowledge of optimization algorithms and learning theories.
  2. Implementation Skills

    • Proficiency in relevant programming languages (e.g., Python, C++).
    • Expertise in deep learning frameworks (e.g., TensorFlow, PyTorch).
    • Skills in efficient model implementation and debugging.
    • Writing clear, maintainable, and readable code.
  3. Architecture Design

    • Selecting and designing network architectures tailored to task characteristics.
    • Establishing techniques for optimal hyperparameter selection and tuning.
    • Designing custom layers and loss functions specific to problems.
  4. Learning Process Management

    • Applying effective data preprocessing and augmentation techniques.
    • Monitoring training processes and defining early stopping conditions.
    • Selecting appropriate evaluation metrics and interpreting results accurately.
  5. Real-World Application and Integration

    • Designing and implementing end-to-end learning systems.
    • Leveraging transfer learning and pre-trained models effectively.
    • Deploying and managing models considering real-world constraints.
  6. Model Interpretation and Explanation Skills

    • Understanding the behavior of black-box models.
    • Enhancing model interpretability.
    • Assessing the capabilities and limitations of AI models.
    • Communicating model functionality to stakeholders effectively.
  7. Ethics and Compliance

    • Understanding data privacy and ethical AI development.
    • Ensuring compliance with regulations and guidelines.

1.3.2 Skill Set Requirements for Machine Learning Engineers

Beyond standard engineering tasks, machine learning engineers require the following specialized skills and tasks:

  1. Knowledge of Computing Environments and Hardware

    • GPU programming expertise.
    • Proficiency in leveraging cloud computing platforms.
    • Optimization skills for parallel computing environments (e.g., GPUs, TPUs).
  2. Data Management and Processing

    • Ability to construct and manage large-scale datasets.
    • Understanding and operational knowledge of distributed storage systems.
    • Mastery of big data processing frameworks (e.g., Hadoop, Spark).
  3. Model Optimization and Evaluation

    • Techniques for efficiently training large-scale models.
    • Understanding of model compression and quantization methods.
    • Skills in hyperparameter tuning.
    • Designing distributed learning systems and implementing parallel processing.
    • Efficient management of large-scale training using cloud resources and GPUs.

1.4 Data Visualization and Reporting

Data scientists must present complex analytical results in ways that decision-makers and non-technical stakeholders can easily understand, delivering profound insights based on data. Through advanced data visualization and reporting activities, they directly link analytical findings to organizational decision-making, enabling data-driven strategy formulation. This goes beyond merely displaying data—it involves interpreting advanced analytical results from statistical inference, machine learning, and causal inference to provide actionable insights.

The responsibilities in this role include:

a) Advanced Statistical Analysis Visualization

  • Visualization of multivariate analysis results (e.g., PCA, factor analysis).
  • Displaying performance metrics for machine learning models (e.g., ROC curves, confusion matrices).
  • Decomposing and visualizing trends and seasonality in complex time-series data.

b) Predictive Model Interpretation and Visualization

  • Illustrating the influence of key variables in decision trees or regression models.
  • Visualizing intermediate layers of deep learning models (e.g., feature maps).
  • Representing predictive confidence intervals and risk assessments visually.

c) Dimensionality Reduction and Visualization for Large Datasets

  • Employing t-SNE or UMAP for 2D/3D representation of high-dimensional data.
  • Interactive visualization of clustering results.
  • Graph representation of network analysis results (e.g., social network analysis).

d) Causal Inference Visualization

  • Visualizing intervention effect estimation results.
  • Comparative visualization of counterfactual scenarios.
  • Diagramming results from structural equation modeling (SEM).

e) Uncertainty and Risk Visualization

  • Displaying probability distributions from Monte Carlo simulation results.
  • Visualizing posterior distributions and credible intervals in Bayesian estimation.
  • Using tornado diagrams to represent sensitivity analysis results.

f) Advanced Interactive Visualizations

  • Dynamic visualization of real-time data streams.
  • Dashboards with complex filtering and drill-down capabilities.
  • Immersive visualization of multidimensional data using VR/AR.

g) Explainable AI (XAI) Visualization

  • Local Interpretation with SHAP and LIME:
    These methods are used to mitigate the black-box nature of machine learning models, making decision processes transparent, particularly in domains like finance, healthcare, and law where decision justification is critical:

    • Explaining reasons behind specific predictions.
    • Assessing model reliability and fairness.
    • Identifying areas for model improvement.
  • Global Interpretation with PDP and ALE Plots:
    Partial dependence plots and accumulated local effects plots help in understanding the global behavior of models by showcasing:

    • Non-linear relationships.
    • Feature importance evaluation.
    • Interaction between features.
    • Alignment with domain knowledge.
  • Adversarial Sample Analysis and Visualization:
    The study and visualization of adversarial samples highlight vulnerabilities in machine learning models and enhance their robustness:

    • Identifying model weaknesses.
    • Evaluating security risks.
    • Improving model reliability and security.
    • Increasing interpretability.
    • Continuous monitoring and reporting to stakeholders.
    • Providing actionable insights for ongoing enhancements.

h) Advanced Narrative Building and Insight Communication

  • Transforming complex analytical results into a story for non-technical audiences.
  • Reporting with a balance between statistical significance and practical relevance.
  • Providing layered insights tailored to different stakeholders.

i) Reporting of Predictive Analytics Results

  • Visualizing probabilistic predictions for future scenarios.
  • Developing "What-if" analysis tools and dynamically displaying results.
  • Tracking and visualizing the performance evolution of predictive models over time.

j) Anomaly Detection and Root Cause Analysis Visualization

  • Visualizing multivariate anomaly detection results using heatmaps.
  • Considering the context of time-series anomalies in visual representations.
  • Interactive exploration tools for causal graphs in root cause analysis.

2. Step 2: The Rise of Generative AI (Democratization of AI)

The fields of Artificial Intelligence (AI) and Machine Learning (ML) have undergone dramatic evolution in recent years. Starting with the rise of deep learning in Step 1, the transition into the era of Generative AI in Step 2 has significantly expanded the technology and its applications. One of the most prominent changes in this transition has been the scaling up of models. Models that once had only billions of parameters have now ballooned to the scale of hundreds of billions. The advent of large language models like GPT-3, BERT, and T5 has greatly expanded the potential of Natural Language Processing (NLP), enabling human-like text generation and complex task execution.


At the same time, the generalization of transfer learning and fine-tuning techniques has led to the widespread adoption of pre-trained models. This has made it possible to efficiently adapt models to specific tasks, significantly improving the speed and accuracy of AI development. Furthermore, advances in multimodal learning have led to models that can integrate different types of data such as text, images, and speech. Innovative architectures like DALL-E and CLIP demonstrate the ability to generate images from text or understand the content of images through language, significantly broadening the scope of AI applications.


There has also been remarkable progress in generative models, with the shift from GANs to Diffusion Models. This transition has enabled the generation of higher-quality and more diverse images, audio, and text, accelerating the use of AI in creative fields. Additionally, the democratization of AI and the rise of low-code platforms have made it possible for people with no specialized knowledge to utilize AI functionalities. Platforms like Hugging Face and OpenAI API have made it easier to integrate AI into general applications, broadening its influence across business and daily life.


In this context, the role of data scientists is also evolving. However, the degree of this evolution varies significantly across organizations and industries. In advanced companies, data scientists are increasingly involved in the design of AI systems and strategy planning, while in many organizations, traditional data analysis and building basic machine learning models remain the core tasks. With the rapid advancement of technology, the skill set required for data scientists has expanded, but many practitioners and organizations are struggling to keep up with these changes.


Therefore, while the influence of AI and machine learning is undoubtedly growing, challenges remain in their actual application and value creation. In the era of Generative AI (Step 2), the success or failure of companies will largely depend on how they leverage and integrate this technology into their organizations. The way data scientists' roles are defined will be one of the key factors influencing whether they succeed or fail.

2.1 The Beginning of the "Generative AI Era"

It is difficult to pinpoint the exact start date of the "Generative AI Era," but in this paper, around 2022 is considered a key reference point. Below are the main historical contexts related to this shift.

  • 2014: Introduction of GANs (Generative Adversarial Networks): Proposed by Ian Goodfellow and others, opening up new possibilities for generative models.
  • 2017: Introduction of Transformer architecture: Proposed in the paper "Attention Is All You Need," forming the foundation for large-scale language models.
  • 2018-2019: Introduction of large-scale language models like GPT-2 and BERT: Achieving high performance in natural language processing tasks, drawing significant attention.
  • 2020: Introduction of GPT-3: A massive model with 175 billion parameters, demonstrating the ability to learn various tasks from a small number of examples. This marks the broad recognition of the potential of generative AI.
  • 2021: Introduction of DALL-E, Midjourney, and other image-generating AIs: Demonstrating the ability to generate high-quality images from text, expanding the possibilities for creative AI applications.
  • 2022: Public release of ChatGPT: OpenAI's release of a conversational AI made the potential of generative AI widely known to the general public. This year is often referred to as the "Year of Generative AI."
  • 2023: Introduction of GPT-4 and the proliferation of various generative AI tools: Models with advanced understanding and generation capabilities emerge, and generative AI tools become widely available across various domains like image generation, voice generation, and code generation.

Looking at this timeline, one might interpret the 2020 release of GPT-3 as the symbolic beginning of the "Generative AI Era" from a technological standpoint. Alternatively, the 2022 public release of ChatGPT marked a pivotal moment when cutting-edge AI began to make a profound impact on daily life and business, which could also be seen as the beginning of the "Generative AI Era." Both interpretations are valid, but this paper suggests that the "democratization of AI," meaning when people began to easily access and actively use AI, was the more epoch-making event.


Now, let's take a look at the abstract section of the paper "Language Models are Few-Shot Learners" where GPT-3 was first introduced in 2020.


Recent research has shown that by pre-training on large text corpora and then fine-tuning on specific tasks, significant results have been achieved across many NLP tasks and benchmarks. This architecture is generally task-independent, but still requires task-specific fine-tuning datasets containing thousands to tens of thousands of examples. In contrast, humans can typically perform new language tasks with only a few examples or simple instructions—something current NLP systems still struggle with significantly.

Here, we show that scaling up language models significantly improves task-independent few-shot performance and, in some cases, achieves levels that can rival previously state-of-the-art fine-tuning approaches. Specifically, we trained GPT-3, an autoregressive language model with 175 billion parameters. This is more than ten times the parameters of previous non-sparse language models. We then tested its performance in a few-shot setting. For all tasks, GPT-3 was applied without any gradient updates or fine-tuning, and the task and few-shot instructions were specified purely through text interactions with the model.

As a result, GPT-3 achieved high performance on many NLP datasets, such as translation, question answering, and cloze tasks. It also demonstrated high performance on tasks requiring on-the-fly reasoning and domain adaptation, such as word unscrambling, using new words in sentences, and performing three-digit arithmetic. At the same time, we identified datasets where GPT-3's few-shot learning still struggled, and where there were methodological challenges related to training on large web corpora. Finally, we found that GPT-3 could generate news article samples that were difficult to distinguish from human-written articles. This discovery, along with a broader discussion of the societal implications of GPT-3, is explored.


This passage describes the groundbreaking achievement that we now take for granted in systems like ChatGPT: the ability to "learn" from minimal information (few-shot) and respond appropriately to prompts. By simply providing examples or specific instructions in the form of a prompt, the model is able to generate the desired output.


In addition to the release of ChatGPT, 2022 marked another epoch with the interesting paper Emergent Abilities of Large Language Models from Google Research. It was already known that increasing the size of a language model improved its performance, but interestingly, it was also discovered that "beyond a certain model size, unexpected abilities suddenly emerge." This was the first study to explore this phenomenon of "emergent abilities" in large language models. Although the paper did not reach a full understanding of why emergent abilities occur, it organized perspectives on how to study such phenomena. In the near future, we may enter an era where the mechanisms behind emergent abilities are uncovered. If that happens, it could lead to the possibility that most human abilities may be replaced with a high level of efficiency. Furthermore, life is designed by the base sequence of DNA (i.e., large strings), and life has evolved (emerged new abilities) from the "bacterial bloom" 3 billion years ago to the present. However, we currently know almost nothing about the mechanisms behind such emergent phenomena. The understanding and application of these emergent mechanisms may become a key theme for the next era of the "Generative AI Era," as imagined by the author.

2.2 Application of Generative AI in Business

2.2.1 Automation of Internal Operations

Generative AI is beginning to streamline various business processes traditionally performed by humans, supplementing or replacing human labor. While not directly related to data science, many of these tasks include the following:

  1. Document Creation and Management

    • Report Generation: Automatically create management reports, progress reports, meeting minutes, etc.
      • Example: Generative AI creates a draft of a management report, and the responsible person reviews and edits it.
    • Document Summarization: Summarize long reports and contracts, extracting key points.
      • Example: Automatically converting a 100-page technical report into a 2-page summary.
  2. Customer Support

    • Internal Help Desk: Automatically respond to IT-related or HR-related queries from employees.
      • Example: Providing detailed steps for "How to apply for paid leave?"
    • Automatic FAQ Generation: Automatically create and update frequently asked questions and their answers.
      • Example: Automatically generating an FAQ for a new product based on information from similar past products.
  3. Programming and Development

    • Code Generation: Automatically generate basic code based on specifications.
      • Example: Automatically generating the basic structure for database connections or API endpoints.
    • Bug Fixing Support: Analyze error messages and suggest possible fixes.
      • Example: Providing error causes and solutions similar to StackOverflow.
  4. Data Analysis and Reporting

    • Data Cleansing: Automatically format raw data into a form suitable for analysis.
      • Example: Converting customer data in different formats into a unified format.
    • Insight Generation: Automatically extract important trends and anomalies from datasets.
      • Example: Automatically analyzing monthly sales data to identify growth rates by region or seasonal fluctuations.
  5. Scheduling and Coordination

    • Meeting Scheduling: Suggest optimal meeting times considering the schedules of multiple participants.
      • Example: Automatically proposing a time when all 10 executives are available based on their schedules.
    • Project Management: Generate the optimal schedule while considering task dependencies.
      • Example: Creating the optimal Gantt chart for a large project with more than 100 tasks.
  6. Marketing Content Creation

    • Social Media Posts: Automatically generate posts that align with the brand's tone of voice.
      • Example: Automatically generating Twitter and Instagram posts from a new product press release.
    • Email Campaigns: Generate personalized email content tailored to customer segments.
      • Example: Creating promotional emails tailored to each customer based on their purchase history.
  7. Legal and Contract Management

    • Contract Review: Analyze contracts to highlight potential risks and issues.
      • Example: Automatically detecting unfavorable clauses or vague terms in a 100-page licensing agreement.
    • Answering Legal Queries: Respond to internal legal questions, referencing relevant laws and past precedents.
      • Example: Answering the question, "Does this new service violate personal data protection laws?" with specific legal references.

As demonstrated above, generative AI has the potential to be applied across a wide range of business areas. However, it is important to note that generative AI does not fully automate these tasks but rather assists human work, enhancing efficiency. In many cases, the output generated by AI still requires human review and editing, highlighting the importance of collaboration between humans and AI.


Furthermore, to successfully implement these automations, careful preparation and continuous improvement are necessary, including proper data preparation, AI model fine-tuning, and the redesign of business processes. Privacy, security, and ethical considerations also pose significant challenges. The introduction of generative AI should be seen not merely as a means of improving operational efficiency but as an opportunity for employees to focus on more creative and strategic tasks.

When adopting the recent trend of "replacing what can be replaced by AI in existing operations," companies need to keep two key points in mind: the rapid obsolescence of technology and the higher-level positioning of data scientists.

  1. Rapid Obsolescence and Evolution of Technology:

    • If the push to AI-enable internal operations is done hastily without aligning properly with the internal IT roadmap, an overflow of End User Computing (EUC) tools may occur, making it difficult for the IT department and the company to keep track of the overall picture.
    • At the same time, the evolution of generative AI is moving at an exceptional pace, making it highly risky to rely on a single solution. Legacy models may become unsupported, causing systems to malfunction, and the cost and frequency of replacing them with newer models will likely be higher than before.
    • Additionally, there is a similar risk when investing heavily in developing proprietary AI tools. After paying significant development fees to vendors to create a custom model, a new generative AI model may emerge in a matter of months that surpasses the performance of the company's proprietary model. In such a case, recovering the investment may become impossible, and the risk is more significant than it might initially appear. Alternatively, the company might be forced to pay substantial development costs again to rebuild the model with the vendor.
  2. Elevating the Role of Data Scientists:

    • To mitigate the issues and risks outlined above, it is crucial to develop data scientists not merely as analysts, but as experts who can also assess future technologies.
    • Specifically, decisions on which technologies and processes to keep in-house, which parts to use general generative AI for, and where to take the next step in fine-tuning a custom AI require individuals who are well-versed in technology and trends, as it is easy to misjudge these decisions without that expertise.
    • Particularly in departments like sales, where decisions are often pushed forward based on loud demands, IT and engineering teams may struggle to raise important concerns about future risks to business efficiency when implementing such requests.
    • Therefore, it is important to position data scientists in a more elevated role, establishing a reasonable balance of power between them, the sales/front-end departments, and the IT/engineering teams. This will be a key management challenge. Additionally, granting data scientists the responsibility and authority to ensure alignment with business strategies and consider societal and ethical implications will be essential. For this, it will also be crucial to gain support from the executive leadership and transform the organizational culture.

2.2.2 Prompt Engineering is Not Data Science, but Data Scientists Should Understand How to Use Generative AI

Automating tasks using LLM (Large Language Models) prompt engineering—where commands are designed to make it easier for AI to generate the desired answer—can be incredibly easy to execute. Tasks that would be impossible or require massive investments when written using traditional rule-based logic can be executed with extremely short commands. This is quite shocking, which is why generative AI has been explosively popular.

However, while prompt engineering is often encouraged as a key skill for effectively using generative AI, is focusing too much on prompt engineering the right approach?

Certainly, if you only need a good result once, then prompt engineering can be fine. But generally, when automating business processes, this is not the best approach. This is because what works with a specific version of a particular LLM model might not work with other versions. In other words, humans may end up overfitting to a single version of a model. As mentioned earlier, with the rapid obsolescence of technology, when a new model comes out, you could end up reworking prompts all over again—a tiresome and unnecessary task.

A more realistic approach is to focus on the sub-tasks within the overall task where LLMs excel, and use as simple prompts as possible for those sub-tasks. For tasks where LLMs are not effective, it is better to implement them using other machine learning techniques. In these cases, data scientists are responsible for identifying which tasks are better suited for different machine learning techniques, and then implementing those tasks as a Proof of Concept (PoC).

For example, imagine you want to create a list of product and service names and their descriptions from public documents of all listed companies, and you want the list to be in both Japanese and English. If a human were to do this, they would find the product pages on the company's website, copy and paste the product name and description into an Excel sheet, and repeat the same task on the corresponding English pages. While it sounds simple when written down, the task would require a tremendous amount of manpower, and the costs associated with it would be substantial.

In order to automate such tasks without human intervention, it is necessary to combine various technologies. For example, we can break down the work process into the following six stages:

  1. Find and list product and service introduction pages from publicly available company websites.
  2. Visit these web pages and extract text (only from pages that can be crawled).
  3. Adjust the text so that AI can perform optimally (e.g., break the text into chunks and categorize them into those that are likely to contain product names and descriptions and those that are not).
  4. Send prompts to the AI to extract product names and descriptions, and generate a table based on the results.
  5. Evaluate the extracted results for correctness; if the results are incorrect, use another method to achieve the expected outcome.
  6. Integrate the results in Japanese and English, and check the consistency and completeness of the Japanese-English correspondence.

It is possible to leverage generative AI in all of the above steps, though there are many aspects that can be done without it. However, without knowledge of not only the latest technological trends but also conventional machine learning techniques and engineering skills, it would be difficult to identify possible alternatives, challenges, risks, performance comparison, and cost considerations.

If engineers without expertise in data science attempt to handle this process using only prompt engineering and typical data processing, they are likely to over-rely on prompt engineering and fail to produce good results. This scenario highlights that, in the era of generative AI, the importance of data scientists will only continue to grow.

2.2.3 The Impact of Generative AI on Application Development and Data Science

Generative AI is ushering in a significant turning point in the technology industry. For example, the emergence of services that allow the development and deployment of web applications from scratch using just prompts symbolizes this transformation. On the surface, it may appear that anyone can now easily create applications, advancing the democratization of development. However, the true essence of this change lies deeper.

This technological innovation is creating two major trends. In the short term, we will indeed see increased development efficiency and democratization, leading to the creation of many applications worldwide. However, a more important long-term change is the emergence of what could be called "Sage AI," or artificial superintelligence (ASI), which is expected to bring structural transformation. This ASI could potentially provide value directly via APIs without even requiring the application form.

Nevertheless, such a change does not negate human creative value. On the contrary, as simple development tasks are automated, the creation of truly valuable solutions, which only humans can accomplish, will become even more important. The evolution of technology will strengthen our recognition of the importance of creative thinking.

In the coming era, engineers and data scientists will be most needed for their ability to ask the right questions of AI. This is not just about the skill of writing prompts. It is about deeply understanding the essence of technology and being able to engage in creative dialogue based on that understanding. Moreover, the ability to translate that technical understanding into actual business value will become essential.

In the field of data science, the importance of this fundamental understanding is particularly apparent. Blindly accepting AI-generated code or analysis results can lead to extremely dangerous outcomes. Without a deep understanding of statistical theory, the foundational principles of machine learning, and the assumptions and limitations of those methods, one cannot make appropriate judgments. Theoretical understanding serves as the foundation for practical value creation.

The technology professionals of the future will need to integrate three key capabilities. First, the ability to deeply understand the essence of technology. Second, the ability to accurately grasp the business and societal context. Third, the ability to create new value based on these understandings. With these three capabilities, one can generate truly valuable solutions in partnership with AI.

Technological innovation will continue to accelerate, and AI capabilities will improve further. However, rather than diminishing human creative value, this will enhance it. By gaining access to more powerful tools, humans will be able to focus more on essential creative thinking. A deep understanding of foundational theory, the ability to ask creative questions, and the capacity to conceive new value will empower individuals to truly lead the technology field in the age of AI.

Let's now summarize the future demand trends for data scientists. The factors contributing to both the increase and decrease in demand are as follows:

  • Factors Reducing Demand:

    1. Advancement of Automation Tools: With generative AI making data preprocessing and basic analysis tasks easier, individuals without a background in data science will be able to perform similar tasks.
    2. Proliferation of No-Code/Low-Code Platforms: As the trend of no-code and low-code platforms continues to grow, non-engineers can implement business applications and perform data analysis, reducing the reliance on data scientists for basic analysis tasks.
  • Factors Increasing Demand:

    1. Increasing Importance of Data-Driven Decision Making: More companies are adopting data analysis as a core part of their operations. Generative AI has significantly lowered the barriers to data analysis, leading businesses that were previously reluctant to engage in data-driven strategies to do so actively.
    2. Emergence of New Data Sources: With the rise of IoT, social media, and other platforms, an unprecedented amount of data is being collected and made available for analysis.
    3. New Business Opportunities from Technological Advances: For example, quantum computing, which is still at least 10 years away from widespread use, could create new roles for data scientists once it becomes practical. Even without such a significant development, smaller technological advancements continue to expand the possibilities for data scientists, leading to an increase in demand for their expertise.

From the above, it is clear that, on balance, the demand for data scientists is likely to increase in the long term, with no significant decrease in sight. As the value of information continues to rise, companies will increasingly need to harness it effectively. In this era, it will be difficult for general companies to become leaders in AI modeling by developing generative AI from scratch or manufacturing the GPUs required for computation. However, this does not mean they are powerless. Investing in the human brain that can leverage these tools effectively is a very practical and important choice.

2.4 Structural Changes and the Future of Data Scientists Due to the Development of Generative AI

The rapid development of generative AI is bringing revolutionary changes to the field of data science. This implies that companies must adapt to the new era in order to survive, and as a result, the role and skill set required for data scientists must be fundamentally redefined.

2.4.1 Structural Changes Brought by Generative AI

The structural changes in data science brought by the rapid development of generative AI can be summarized as follows:

  1. Acceleration of Automation:

    • Automation of basic data processing, cleaning, and feature engineering
    • Efficiency improvements in the development process through code generation
    • Automation of report generation and basic analysis tasks
  2. Democratization of Knowledge:

    • The proliferation of tools that allow non-experts to perform advanced analysis
    • Simplification of data querying via natural language interfaces
    • The low-cost and fast access to specialized knowledge embedded in AI
      • Advances in transfer learning: Efficiency in acquiring domain-specific knowledge
  3. Paradigm Shift in Data Generation:

    • Large-scale generation of synthetic data to expand training datasets
    • Improved ability to simulate rare events and anomaly cases
    • New methods for data sharing and utilization under privacy protection
  4. Advancement of Analysis and Expansion of Business Options:

    • Generalization of multimodal data integration for analysis
    • Increased demand for real-time analysis and immediate decision support
    • Improved feasibility of tackling more complex problems and applying advanced analysis techniques
    • Expanding business options by incorporating data science into higher-level business decision-making processes
  5. Increased Importance of Ethics and Governance:

    • The need to verify the authenticity and reliability of AI-generated content
    • Emerging challenges related to data privacy and fairness
    • Growing demands for explainability and transparency in AI systems

2.4.2 Corporate Response Capabilities in the Face of Rapid AI Developments

The rapid structural changes brought by generative AI present two major challenges for companies. By appropriately addressing these challenges, companies can establish a competitive advantage and mitigate various risks brought by AI.

  • First Challenge - Redefining the Role of Data Scientists: Traditionally, data scientists have been seen merely as problem-solvers. However, with the evolution of AI, their role needs to evolve into a more strategic one in addition to the traditional position. For example, by incorporating the insights of top-tier data scientists into executive decision-making, companies can uncover promising options that might have been overlooked by the business planning team. This also helps mitigate the following risks when steering through rapid technological innovation:

    • Risk of Rapid Obsolescence: Influenced by the rapid evolution of AI, a company might decide to adopt generative AI to benefit its business. They may invest significant capital in hiring external experts to build AI services. However, the service created could be rendered obsolete overnight, as newer generative AI models may offer similar functions "almost for free," causing the company's investment to quickly become outdated and wasted.
    • Risk of Excessive Caution: Fear of obsolescence might lead companies to do nothing at all, resulting in their low-productivity services or high-cost products losing competitiveness in the market. Alternatively, a company may implement AI in areas without much impact, just to give the impression of doing something, while competitors successfully develop disruptive technologies and pull far ahead.
  • Second Challenge - Correctly Assessing and Properly Developing the Skills of Data Scientists: Alongside redefining the role of data scientists, companies must acquire and nurture talent according to this new definition. By forming teams of data scientists who can rapidly enhance their skills, leveraging rapidly evolving AI, companies can mitigate the following risks:

    • Over-reliance on Generative AI: The rapid evolution of generative AI might lead to employees or data scientists neglecting to research or think for themselves, relying too heavily on AI. Particularly for companies that have not redefined the role of data scientists, there may be a push to simply "become proficient in generative AI, raise AI literacy." What could happen as a result? An increasing number of employees may begin to think, "Let AI handle the work, and I can just get by." Others may give up entirely, believing that "there's no way a human can compete with AI that improves daily." Indeed, as AI continues to improve, both specialized tasks and commoditized work will increasingly be replaced by AI.

    However, the company should not allow its valuable human resources to simply master AI to become tools for AI, but rather use this new tool to maximize efficiency in their tasks. The goal should be to guide employees to absorb specialized knowledge—including that provided by AI—efficiently and rapidly grow. Companies that establish such systems will attract top talent as well.

By addressing these two challenges, companies can make accurate investment decisions regarding AI and reduce the risks of over-reliance on generative AI among their staff. For instance, by redefining the role of data scientists, companies can avoid over-investing in expensive AI systems and instead adopt strategies to effectively leverage existing AI tools. At the same time, properly developed data scientists will use AI not just as a tool, but as a means to creatively solve problems.

As a result, companies can adapt to the rapid evolution of AI, avoid unnecessary investments, attract top talent, and maximize the potential of that talent. In this way, they can establish a sustainable competitive advantage in an AI-driven business environment.

2.4.2 Changes in the Role of Data Scientists

Let's now review the changes in the role of data scientists. While the traditional role as a specialized analyst remains, additional responsibilities are expected to be added to the role of senior data scientists. The key feature is that cross-functional skills will be required more than ever before.

  1. Strategic Problem Setter:

    • Leading the development and execution of the organization’s data utilization strategy
    • Translating business challenges into “AI-solvable” problems
  2. AI-Human Collaboration Orchestrator:

    • Building collaborative models that maximize the strengths of both AI and human experts
    • Promoting AI literacy within the organization
  3. AI System Designer:

    • Designing solutions that integrate multiple AI models with traditional analysis methods
    • Architecting and optimizing AI systems
  4. Advanced Problem Solver:

    • Developing innovative analytical approaches
    • Proposing creative solutions to complex interdisciplinary problems
  5. Ethical AI Advocate:

    • Ensuring fairness, transparency, and explainability in AI
    • Developing and implementing ethical guidelines for AI usage

Such skills cannot be acquired overnight. Even if companies hire data scientists who are considered top-notch externally, they will likely struggle to perform well in a short period due to a lack of domain knowledge. Similarly, hiring external consultants in a hurry may result in generic suggestions because of insufficient data science expertise. Therefore, companies need to approach the development of their data scientists from a long-term and comprehensive perspective.

In conclusion, the development of generative AI will elevate the role of data scientists from "specialized analysts" to encompassing "strategic partners." Data scientists who can adapt to this change, collaborate with AI, and deliver human-centered value will play a central role in the future data-driven society.

3. Step 3: Long-Term Outlook – The Explosive Evolution of AI

Finally, let's briefly look at the near future predictions. By around 2034, I anticipate that the role of data scientists will further expand and evolve into becoming designers and managers of AI systems across society. The technological backdrop for such a development could include the following features. Of course, there is no solid evidence for this, and it is a discussion based on the possibilities.

3.1 Key Features of Technological Development

Long-term technological development can be categorized along several axes. 1. The evolution of AI theory and technology itself, 2. The expansion of possibilities through improved computational power like quantum computers, 3. Changes in the interface between humans and AI, leading to qualitative changes from a user perspective, and 4. The development of AI applications across various fields.

  1. The Evolution of AI Theory and Technology: The Emergence of Self-Evolving AI and Self-Creating AI Leading to an Explosive Acceleration of Technological Progress

    • Self-Evolving AI: AI that can autonomously improve its algorithms and structure based on existing learning data and experience, leading to performance enhancement.
    • Self-Creating AI: AI capable of generating completely new concepts and solutions beyond existing knowledge and experience.
    • Mechanisms for Explosive Acceleration of Technological Progress:
      • a) Continuous Self-Improvement: Self-evolving AI will constantly analyze its performance, identify areas for improvement, and autonomously make updates. This process will continue 24/7, resulting in exponential performance improvements without human intervention.
      • b) Creative Problem Solving: Self-creating AI will produce new approaches and solutions that are unconstrained by existing frameworks. This will lead to frequent innovative breakthroughs that were previously unattainable through human thinking.
      • c) Cross-Disciplinary Knowledge Integration: These AIs will integrate knowledge from various fields independently, leading to new discoveries and technologies. For example, combining knowledge from biology and materials science could allow the design of entirely new materials with unique properties.
      • d) Ultra-Fast Experimentation and Validation: AI will be able to conduct and analyze countless experiments and simulations in virtual environments in an instant, determining optimal solutions. This will drastically shorten research and development processes that previously took years.
      • e) AI Developing AI: More advanced AI will begin designing and developing the next generation of AI. As this cycle repeats, AI capabilities will rapidly improve, further accelerating the speed of technological evolution.
  2. The Evolution of Computational Power:

    • Partial Realization of Ultra-Fast Computation by Quantum Computers: Quantum computers will be able to complete large-scale computations, which would take years to achieve with today's GPU parallelization technology and are not yet realistic research subjects, within practical timeframes. While it is likely that full practical implementation of quantum AI will not be realized in 10 years, quantum AI's ultra-fast computation might become operational for specific fields or limited problems such as cryptography or combinatorial optimization.
    • GPU Speed Improvements: GPUs themselves are expected to experience significant performance improvements. With advancements in nanometer-scale manufacturing technology, 3D stacking, the adoption of new materials, and the development of innovative architectures or dedicated AI computation units, GPU performance and efficiency will likely increase dramatically.
    • Evolution of Programming Languages: Currently, machine learning typically uses Python for coding. However, preprocessing tasks with pandas can be slow, which is one of the reasons for decreased productivity among data scientists. Additionally, in practical implementations, there is often a need to rewrite code into faster languages, leading to extra work. In the future, compiler languages with Python-like syntax may emerge, and a new language might take Python's place, improving computational and business efficiency in practice.
  3. Development of Bio-AI Interfaces: Human Augmentation AI

    • What is a Bio-AI Interface?: There is potential for technologies that directly connect the human brain and nervous system to AI systems. This would allow humans to directly communicate thoughts or intentions to AI or, conversely, have information transmitted from AI directly to the brain, even if partially.
    • Human Augmentation AI: Human augmentation AI refers to technologies that use AI to enhance human cognitive abilities, memory, and information processing. Through bio-AI interfaces, a close collaboration between the human brain and AI could become possible. For individuals who have lost certain bodily functions due to illness or accident, AI (and robotics) could potentially restore those functions.
  4. Application Development Axes:

    a. Predictive Social Simulators

    • What is a Predictive Social Simulator?: A "Predictive Social Simulator" is a forecasting engine. It uses large-scale data and AI to comprehensively simulate various aspects of society (such as the economy, demographics, and social behavior) and predict future developments. This includes scenarios like "what-if" analyses for forecasting specific futures.
    • Potential Application Areas:
      • Policy Making: Assessing the long-term impacts of various policies
      • Crisis Management: Considering responses to long-term crises like pandemics or climate change
      • Urban Planning: Long-term urban development plans based on demographic and economic changes
      • Social Trend Prediction: Predicting societal trends such as consumer behavior and changes in values

    b. Societal Large-Scale Optimization Systems

    • What is a Societal Large-Scale Optimization System?: This system comprehensively analyzes and optimizes large-scale societal systems, such as cities or nations, using AI and big data. The approach could also include optimizing predictions from the above simulators. Rather than focusing on individual elements, the system may optimize for overall efficiency, sustainability, and fairness. In such cases, the design of loss functions (or utility functions) would be critical, so engineers involved would need to foster awareness of both technical expertise and societal ethical norms.
    • Significance: It has the potential to bring more scientific and objective approaches to areas currently influenced by political debates and subjective judgments.
      • Data-Driven Decision Making:
        • Current: Many policy decisions are based on limited data, personal experience, or political ideology.
        • Future: Large-scale data analysis and AI modeling will allow for more accurate predictions of the positive and negative impacts of policies, facilitating evidence-based decisions.
      • Scientific Analysis of Complex Social Issues:
        • Current: Complex issues such as economics, environment, and public health are often overly simplified in discussions.
        • Future: By modeling real-world causal relationships, these issues can be comprehensively analyzed in terms of their interactions and long-term impacts.
      • Real-Time Feedback and Adjustment:
        • Current: It takes time to assess the impact of policies, and corrections are often delayed.
        • Future: Real-time data collection and analysis will allow for immediate evaluation of policy impacts, enabling rapid adjustments as needed.
      • Multi-Objective Optimization:
        • Current: Evaluating trade-offs between different policy goals is difficult.
        • Future: Advanced optimization algorithms will allow for simultaneous consideration of multiple objectives, such as economic growth, environmental protection, and social justice, leading to optimal solutions.
      • Bias Reduction:
        • Current: Personal biases or conflicts of interest often influence policy decisions.
        • Future: Data-driven approaches could lead to more objective and fair decisions.

3.2 Change in the Role of Data Scientists

The role of data scientists is expected to dramatically evolve from mere technicians to designers and managers of entire societal systems. To adapt to this change, data scientists will need not only technical skills but also organizational coordination, ethical judgment, an understanding of social systems, and the ability to integrate interdisciplinary knowledge. They will also need skills in modeling and systematizing fragmented knowledge in a comprehensive way. On a societal level, continuous discussions and adjustments will be necessary regarding the coexistence of AI and humans and the impact of technological advancements. Key points to pay attention to include:

  1. Role Advancement: Expanding from data analysis to the design and management of societal systems.
  2. Diversification of Expertise: In addition to technical skills, interdisciplinary knowledge in ethics, social sciences, and risk management will become increasingly important.
  3. Importance of Adaptability: Flexibility to respond to rapidly changing technologies and societal needs will be crucial.
  4. Demand for Leadership: The ability to lead cross-functional and complex projects or interdisciplinary teams will be highly valued.

Naturally, these abilities will not be equally required for all data scientists, and specialization will emerge depending on the domain. Companies will need to correctly assess the capabilities of their workforce and make optimal personnel allocations accordingly.

Comments