Advancements in NLP: Models, Tools, and Research Highlights
Written on
This edition delves into themes such as the emergence of larger language models, advancements in 3D deep learning, multilingual question answering benchmarks, and the auditing of AI systems.
Note: You can now subscribe to the NLP Newsletter to receive future issues directly in your inbox.
Publications
Turing-NLG: A 17-billion-parameter Language Model by Microsoft
Turing Natural Language Generation (T-NLG) represents a groundbreaking 17-billion-parameter language model developed by researchers at Microsoft AI. As the largest language model currently available, T-NLG employs a 78-layer Transformer architecture and surpasses previous benchmarks, specifically NVIDIA’s Megatron-LM, in WikiText-103 perplexity. This model has been evaluated on various tasks, including question answering and abstractive summarization, showcasing features like zero-shot question capabilities and reduced supervision. Its development was facilitated by a training optimization library known as DeepSpeed with ZeRO, which is highlighted later in this newsletter.
Neural-Based Dependency Parsing
Miryam de Lhoneux has released her Ph.D. thesis titled “Linguistically Informed Neural Dependency Parsing for Typologically Diverse Languages.” This research explores neural methods for dependency parsing across various languages that express meaning in structurally unique ways. The thesis suggests that recurrent neural networks (RNNs) and recursive layers can enhance parsers by incorporating essential linguistic knowledge. Additional strategies discussed include polyglot parsing and shared parameters across related and distinct languages.
End-to-End Cloud-Based Information Extraction with BERT
A research team has published findings demonstrating how Transformer models like BERT facilitate end-to-end information extraction from domain-specific business documents, including regulatory filings and property lease agreements. This approach not only optimizes business operations but also highlights the effectiveness of BERT-based models in scenarios with minimal annotated data. The paper outlines a cloud-based application and its implementation details.
Question Answering Benchmark
Wolfson et al. (2020) introduced a question understanding benchmark, detailing a methodology for deconstructing questions necessary for deriving appropriate answers. This approach relies on crowdsourcing to annotate the steps involved in question breakdown, successfully enhancing open-domain question answering using the HotPotQA dataset.
Radioactive Data: Tracing Through Training
Facebook AI researchers have unveiled an intriguing study focused on marking images, termed "radioactive data," to confirm if such datasets were utilized for training machine learning models. They discovered that clever markers can direct features in ways that allow models to recognize radioactive data, even when only a small percentage of the training set is radioactive. The challenge lies in ensuring that any alterations do not compromise model accuracy. The authors assert that this research could aid engineers in tracking dataset usage, thereby improving understanding of how different datasets influence neural network performance.
REALM: Retrieval-Augmented Language Model Pre-Training
REALM is a large-scale retrieval-based approach that utilizes a corpus of textual knowledge to pre-train a language model unsupervised. This method seeks to encapsulate knowledge in an interpretable format by exposing the model to world knowledge, facilitating training and predictions through backpropagation. Evaluated tasks include open-domain question answering benchmarks, demonstrating enhancements in model accuracy alongside modularity and interpretability benefits.
Creativity and Society
Allowing Remote Paper & Poster Presentations at Scientific Conferences
Recently, a petition has been circulated advocating for remote presentations at scientific conferences, particularly in machine learning. Notably, deep learning pioneer Yoshua Bengio has urged individuals to support this petition through his blog.
Abstraction and Reasoning Challenge
François Chollet launched a Kaggle competition featuring the Abstraction and Reasoning Corpus (ARC), which aims to inspire users to develop AI systems capable of tackling reasoning tasks they have not encountered before. The objective is to foster the creation of more resilient AI systems that can autonomously solve new problems, potentially enhancing applications in challenging real-world scenarios, such as self-driving vehicles.
ML and NLP Publications in 2019
Marek Rei has published his annual analysis of machine learning and NLP publication statistics for 2019, covering conferences like ACL, EMNLP, NAACL, and more.
Growing Neural Cellular Automata
Inspired by the self-organization processes seen in organisms like salamanders, researchers have introduced a paper titled “Growing Neural Cellular Automata,” which utilizes a differentiable model for morphogenesis aimed at replicating the behaviors of self-repairing systems. This research seeks to develop machines capable of self-repair with biological robustness and adaptability, benefitting fields like regenerative medicine and biological modeling.
Visualizing Transformer Attention
Hendrik Strobelt shared an excellent repository that enables the rapid creation of an interactive Transformer attention visualization using the Hugging Face library and d3.js.
SketchTransfer: A New Task for Exploring Detail-Invariance and the Abstractions Learned by Deep Networks
SketchTransfer introduces a task designed to evaluate deep neural networks' ability to maintain invariance despite variations in detail. The paper discusses the "detail-invariance" issue and provides a dataset for researchers to examine this phenomenon through unlabelled sketches and labeled real images.
Tools and Datasets
DeepSpeed + ZeRO
Microsoft has released DeepSpeed, an open-source training optimization library compatible with PyTorch, capable of training models with up to 100 billion parameters. Focused on scalability, speed, cost, and usability, DeepSpeed is accompanied by ZeRO, a memory optimization technology that enhances large-scale distributed deep learning on current GPU architectures, improving throughput by three to five times over existing systems. ZeRO enables the training of models of any size that can fit within the combined available memory.
A Library for Conducting Fast and Efficient 3D Deep Learning Research
PyTorch3D is an open-source toolkit designed to support 3D deep learning research. This library provides optimized implementations of commonly used 3D operators and loss functions, as well as a modular differentiable renderer for research involving complex 3D inputs, supporting high-quality 3D predictions.
Managing Configuration of ML Projects
Hydra is a configuration management tool in Python that streamlines the handling of complex ML projects. It enables PyTorch researchers to reuse configurations functionally. One of its main advantages is allowing programmers to compose configurations as easily as writing code, facilitating easy overrides. Additionally, Hydra aids in managing the working directory for ML project outputs, which is particularly useful for saving and accessing results from multiple experiments.
A Toolkit for Causal Inferencing with Bayesian Networks
CausalNex is a toolkit for causal inference using Bayesian networks, combining machine learning with causal reasoning to unveil structural relationships in data. The authors have also provided a beginner's guide on inferring causation with Bayesian networks utilizing the proposed Python library.
Google Colab Pro is Now Available
Google Colab has launched a Pro edition that provides benefits such as access to faster GPUs and TPUs, extended runtimes, and increased memory.
TyDi QA: A Multilingual Question Answering Benchmark
Google AI has introduced TyDi QA, a multilingual dataset aimed at encouraging researchers to tackle question answering in a variety of typologically diverse languages. The goal is to inspire the development of robust models for languages such as Arabic, Bengali, Korean, Russian, Telugu, and Thai, ultimately generalizing to an even broader range of languages.
Question Answering for Node.js
Hugging Face has unveiled a question-answering library based on DistilBERT, simplifying the integration of NLP into production using Node.js with minimal code. This model leverages Hugging Face's efficient Tokenizers implementation and TensorFlow.js for machine learning applications in JavaScript.
Ethics in AI
Identifying Subjective Bias in Text
In a recent podcast episode, researcher Diyi Yang discusses how AI can assist in detecting subjective bias in textual content. This area of research is critical for understanding how media consumption can be influenced by biased framing, emphasizing the importance of automated bias identification to enhance consumer awareness.
Artificial Intelligence, Values, and Alignment
The relationship between AI systems and human values is a prominent research focus. DeepMind has published a paper that explores philosophical questions regarding AI alignment, addressing both technical aspects (how to encode values for reliable AI outcomes) and normative considerations (which principles to encode). The paper advocates for a principle-based approach to ensure fairness across differing beliefs.
On Auditing AI Systems
VentureBeat has reported on a new framework named SMACTR, developed by Google researchers in conjunction with other groups, that enables engineers to audit AI systems. This initiative aims to bridge the accountability gap present in current AI technologies deployed for consumer use.
Articles and Blog Posts
On Model Distillation for NLP Systems
In a recent episode of the NLP Highlights podcast, Thomas Wolf and Victor Sanh discuss model distillation, a method to compress large models like BERT for scalable real-world NLP applications. Their approach, DistilBERT, involves constructing smaller models that mimic the behavior of larger ones based on output distribution.
BERT, ELMo, & GPT-2: Contextualized Word Representations
Kawin Ethayarajh has published a post analyzing the contextual effectiveness of models like BERT, ELMo, and GPT-2. The discussion covers various topics, including contextuality measures, context-specificity, and comparisons between static embeddings and contextualized representations.
Sparsity in Neural Networks
ML researcher François Lagunas explores the potential benefits of incorporating sparse tensors in neural network models, aiming to reduce the impractical size and speed of modern models, particularly Transformers. This exploration addresses the challenges of implementing efficient sparsity on GPUs.
Training Your Own Language Model
Hugging Face has released a comprehensive tutorial on training a language model from scratch, utilizing their own Transformers and Tokenizers libraries.
Tokenizers: How Machines Read
Cathal Horan has published a detailed blog post discussing the various types of tokenizers employed by contemporary NLP models, emphasizing the significance of tokenization as a key research area. The post includes guidance on training custom tokenizers using methods like SentencePiece and WordPiece.
Education
Machine Learning at VU University Amsterdam
The 2020 MLVU machine learning course is now available online, featuring a complete set of slides, videos, and syllabus. While it provides an introduction to ML, it also covers advanced topics such as VAEs and GANs.
Math Resources for ML
Suzana Ili? and Machine Learning Tokyo (MLT) have created a repository of free online resources aimed at democratizing ML education, focusing on the foundational mathematics necessary for understanding ML concepts.
Introduction to Deep Learning
Keep up with the “Introduction to Deep Learning” course from MIT, which releases new lectures weekly, along with all materials and coding labs.
Deep Learning with PyTorch
Alfredo Canziani has shared slides and notebooks from his minicourse on Deep Learning with PyTorch, along with a companion website detailing the concepts taught.
Missing Semester of Your CS
The “Missing Semester of Your CS” offers valuable material for data scientists with non-technical backgrounds, covering topics like shell tools, scripting, and version control, created by MIT faculty.
Advanced Deep Learning
CMU has published the syllabus and slides for its “Advanced Deep Learning” course, covering topics such as autoregressive models, generative models, and self-supervised/predictive learning, tailored for MS or Ph.D. students with a solid ML background.
Noteworthy Mentions
You can access the previous NLP Newsletter here, featuring discussions on enhancing conversational agents, the release of language-specific BERT models, free datasets, and new deep learning libraries.
Xu et al. (2020) proposed a method for progressively replacing and compressing a BERT model by breaking it into its original components. This approach yields advantages such as combining original components with compacted model versions, outperforming other knowledge distillation techniques on the GLUE benchmark.
Another intriguing course titled “Introduction to Machine Learning” covers foundational topics in ML, including supervised regression and parameter tuning.
The Greek BERT model (GreekBERT) is now available through the Hugging Face Transformers library.
Jeremy Howard has published a paper detailing the fastai deep learning library, widely utilized for research and educational purposes in deep learning. This resource is highly recommended for developers involved in building and enhancing deep learning and ML frameworks.
Deeplearning.ai has completed the release of its TensorFlow: Data and Deployment Specialization, aiming to educate developers on efficient model deployment and innovative data utilization.
Sebastian Raschka has released a comprehensive review paper titled “Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence,” providing valuable insights into the landscape of machine learning tools.
You can also find the NLP Newsletter on our official website at dair.ai and subscribe there to receive it directly in your inbox.