Creating fake news detector with Hugging Face for the first shot

I finished the Natural Language Specialization on the Coursera, so I wanted to find a small scope task, where I can try out some models about it. The fake news detection seems something that has use-cases nowadays. I would like to experiment in other languages as well, but first, let's see in English.

What did I get from the NLP specialization?

I had already worked with the word2vec model and I had been familiar with LSTM and other recurrent networks. So the first 3 courses was a kind of recap for me, but I really liked the way it was put together. It helped me to interpret the statistical motivations behind the LSTM based word2vec models, so called seq2seq model. The last course was fascinating. It guided me through the attention, transformers topics and introduced the current state-of-the-art NLP models, like BERT, T5, GPT-2 and GPT-3.

What is the Hugging Face?

Understanding a paper doesn't mean that you can reproduce easily the results of that on the machine learning field. There can be many reasons behind it: you can't reach the original dataset, you don't have enough computational resources, the small implementation tricks aren't it the paper, etc. The Hugging Face helps a lot in these details. It is an open-source toolset to build NLP models fast and efficient. You can download pre-trained complex models and use the original tokenization, and you can reach many valuable NLP datasets.

What was my initial goal?

There are articles that you read and shout "oh what a stupid fake bullshit". So the idea is simple: create a model, that can decide that the article is real or fake. Sounds easy, just a binary classification... until you ask yourself what the fake news means. How can I label my data, how can I verify the labels if I use someone else's dataset? The two side of the politics use the fake news on different "facts". What can I be sure about? To calm myself down, I could say: science. But even after the results of an experiment is the same n times, can I be sure if I'm going to get the same result the next occasion? Okay, then there is the math. I can be pretty sure about the math. True or false. The math works. What is true, that must be true. It comes from the axioms. And the axioms work until Gödel asked, do they?

Okay, we got too far. I just wanted to try the fancy NLP models. The fake news topic has big potential, but there are challenges to get a fair enough service. Firstly I need a feedback loop, so that I can iterate my solution. So the first milestone is just getting a dataset with news and labels, do a transfer learning with a pre-trained model and calculate the accuracy. So the relaxation of my initial goal postpones the definition of the correctness and requires only the first experiments with the Hugging Face framework. It is only a prototype of a code base that can run training and evaluation on a custom dataset.

What dataset did I use?

I chose the most ranked fake news dataset on Kaggle. It labeled the data based on the publishers. Is it ethically good? Perhaps not, but I think that most of the people, including me, use the same heuristic on the internet. Because of focusing on the pipeline, I just read in the True.csv and False.csv with pandas and print out some examples, but I didn't put too much effort to explore the data.

fake_news = pd.read_csv("Fake.csv")
real_news = pd.read_csv("True.csv")

def print_news(article, print_max_line_length=20):
    print(f"title: {article['title']}")
    print(f"date: {article['date']}")
    print(f"subject: {article['subject']}")
    print("text:")
    for word_count, word in enumerate(article['text'].split()):
        if word_count % print_max_line_length == 0:
            print("")
        print(word, end=" ")

Sample from the fake news

title: Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing

date: December 31, 2017

subject: News

text:

Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he had to give a shout out to his enemies, haters and the very dishonest fake news media. The former reality show star had just one job to do and he couldn t do it. As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year, President Angry Pants tweeted. 2018 will be a great year for America! As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year. 2018 will be a great year for America! Donald J. Trump (@realDonaldTrump) December 31, 2017Trump s tweet went down about as welll as you d expect.What kind of president sends a New Year s greeting like this despicable, petty, infantile gibberish? Only Trump! His lack of decency won t even allow him to rise above the gutter long enough to wish the American citizens a happy new year! Bishop Talbert Swan (@TalbertSwan) December 31, 2017no one likes you Calvin (@calvinstowell) December 31, 2017Your impeachment would make 2018 a great year for America, but I ll also accept regaining control of Congress. Miranda Yaver (@mirandayaver) December 31, 2017Do you hear yourself talk? When you have to include that many people that hate you you have to wonder? Why do the they all hate me? Alan Sandoval (@AlanSandoval13) December 31, 2017Who uses the word Haters in a New Years wish?? Marlene (@marlene399) December 31, 2017You can t just say happy new year? Koren pollitt (@Korencarpenter) December 31, 2017Here s Trump s New Year s Eve tweet from 2016.Happy New Year to all, including to my many enemies and those who have fought me and lost so badly they just don t know what to do. Love! Donald J. Trump (@realDonaldTrump) December 31, 2016This is nothing new for Trump. He s been doing this for years.Trump has directed messages to his enemies and haters for New Year s, Easter, Thanksgiving, and the anniversary of 9/11. pic.twitter.com/4FPAe2KypA Daniel Dale (@ddale8) December 31, 2017Trump s holiday tweets are clearly not presidential.How long did he work at Hallmark before becoming President? Steven Goodine (@SGoodine) December 31, 2017He s always been like this . . . the only difference is that in the last few years, his filter has been breaking down. Roy Schulze (@thbthttt) December 31, 2017Who, apart from a teenager uses the term haters? Wendy (@WendyWhistles) December 31, 2017he s a fucking 5 year old Who Knows (@rainyday80) December 31, 2017So, to all the people who voted for this a hole thinking he would change once he got into power, you were wrong! 70-year-old men don t change and now he s a year older.Photo by Andrew Burton/Getty Images.

Concerns about the dataset

There is a good discussion about the concerns with the dataset. For example the publisher's name is at the beginning of each real news. It makes the classification trivial. So I added a simple preprocess to the data loading to delete the relevant part of the text at the beginning.

def preprocess_text(example):
    text = example["text"]
    original_len = len(text)
    splits = text.split(" -", maxsplit=1)
    text_without_publisher = splits[1] if len(splits) > 1 else splits[0]
    new_len = len(text_without_publisher)
    if original_len - new_len < 35:
        example["text"] = text_without_publisher
    return example

How did I use Hugging Face?

The biggest challenge for me was to wrap the custom dataset in a Hugging Face dataset. I followed their tutorial, which makes many things easy, but they used a dataset already integrated in Hugging Face. I naively tried to convert the pandas DataFrame to a native python object to use that for initialization... Anyway, I was wrong, it is much simpler as you see below. I had to merge the two DataFrame objects, one for fake news, one for real news into one DataFrame with a "label" column, then I could create a Dataset object by calling datasets.Dataset.from_pandas function.

from datasets import Dataset

def convert_to_dataset_format(fake_news, real_news, fake_news_label=1, real_news_label=0):
    fake_news["label"] = fake_news_label
    real_news["label"] = real_news_label
    all_data = pd.concat((fake_news, real_news))[["text", "label"]]
    dataset = Dataset.from_pandas(all_data)
    return dataset

def dummy_dataset_split(data_frame, ratio=None):
    if ratio is None:
        ratio = {"train": 0.7, "test": 0.3} 
    index_to_split = int(len(data_frame) * ratio["train"]) 
    return data_frame.loc[:index_to_split,:].copy(), data_frame.loc[index_to_split:,:].copy()

split_data_frame = dummy_dataset_split

fake_news_train, fake_news_test = split_data_frame(fake_news)
real_news_train, real_news_test = split_data_frame(real_news)
dataset = {"train": convert_to_dataset_format(fake_news_train, real_news_train),
        "test": convert_to_dataset_format(fake_news_test, real_news_test)}

Here it comes the first cool feature of the Hugging Face. The tokenization in NLP uses several heuristics and intuitions: what kind of word you need to throw out, how to handle the word endings (which is so important in the most of non-English language), etc. We get the same tokenizer that was used at training of the original model.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

dataset = {"train": dataset["train"].map(preprocess_text, batched=False),
                      "test": dataset["test"].map(preprocess_text, batched=False)}
print(dataset["train"][3000])
tokenized_datasets = {"train": dataset["train"].map(tokenize_function, batched=True),
                      "test": dataset["test"].map(tokenize_function, batched=True)}

Results

The things are really straightforward from here, because I just followed the mentioned tutorial. You can rerun my notebook if you are interested in the whole code. I trained and evaluated on a small subset, just to see that the code is running. The test accuracy was around 0.98 in both cases: where I used the publisher eraser preprocess and where I didn't. Does it say anything about the model's performance? Nope. We only wrapped the custom dataset that now can be used within the Hugging Face framework. Which is fine. We have a basis to begin the building of a meaningful fake news detector in some circumstances. We will see the necessary conditions later.