Transfer labels to Label Studio¶
In this example, we will transfer the labels Fondour found to Label Studio for manual evaluation and correction.
Before any labels are annotated, please ensure that the document representations in Fonduer and Label Studio are the same. Otherwise, the labels might not be transferable! See example_document_converter for further information.
Fonduer setup:¶
The way fonduer is set up might influence the ability to transfer labels between the systems. Therefore, Fonduer has to be configured so that it does not need to modify the documents.
import os
project_name = "mails_sm"
conn_string = "postgresql://postgres:postgres@127.0.0.1:5432/"
dataset_path = "data/mails"
export_path = os.path.join(dataset_path, "export.json")
documents_path = os.path.join(dataset_path, "documents")
from LabelstudioToFonduer.to_fonduer import parse_export
export = parse_export(export_path)
Create the fonduer project¶
After that, we create the project in fonduer:
from LabelstudioToFonduer.fonduer_tools import save_create_project
save_create_project(conn_string=conn_string, project_name=project_name)
from fonduer import Meta, init_logging
init_logging(log_dir=os.path.join(dataset_path, "logs"))
session = Meta.init(conn_string + project_name).Session()
[2022-11-17 12:48:12,270][INFO] fonduer.meta:53 - Logging was already initialized to use data/mails/logs/2022-11-17_12-47-34. To configure logging manually, call fonduer.init_logging before initialiting Meta. [2022-11-17 12:48:12,271][INFO] fonduer.meta:135 - Connecting user:postgres to 127.0.0.1:5432/mails_sm [2022-11-17 12:48:12,271][INFO] fonduer.meta:162 - Initializing the storage schema
Fonduer might read the documents with the wrong encoding, which causes errors. To avoid this, a dedicated HTMLDocPreprocessor
can be used. LabelStudio_to_Fonduer
provides a slightly modified HTMLDocPreprocessor
as a starting point named My_HTMLDocPreprocessor.
The processor can be imported like this:
from LabelstudioToFonduer.document_processor import My_HTMLDocPreprocessor
from fonduer.parser import Parser
doc_preprocessor = My_HTMLDocPreprocessor(documents_path, max_docs=100)
Setup lingual parser¶
By default, Fonduer uses a lingual parser that splits sentences based on the SpaCy split_sentences
function. While this method generally performs quite well, it does not handle abbreviations and special punctuation well.
If our labels contain punctuations or abbreviations, we need to use a modified lingual_parser
.
LabelStudio_to_Fonduer
comes with a modified version that splits sentences only on the .
char and can handle given exceptions.
To add exceptions and use this ModifiedSpacyParser
, we can use this code:
from LabelstudioToFonduer.lingual_parser import ModifiedSpacyParser
exceptions = [".NET", "Sr.", ".WEB", ".de", "Jr.", "Inc.", "Senior.", "p.", "m."]
my_parser = ModifiedSpacyParser(lang="en", split_exceptions=exceptions)
Import documents¶
If the pipeline is set up, we can import our documents.
corpus_parser = Parser(session, lingual_parser=my_parser, structural=True, lingual=True, flatten=[])
corpus_parser.apply(doc_preprocessor, parallelism=8)
from fonduer.parser.models import Document, Sentence
print(f"Documents: {session.query(Document).count()}")
print(f"Sentences: {session.query(Sentence).count()}")
docs = session.query(Document).order_by(Document.name).all()
[2022-11-17 12:48:13,670][INFO] fonduer.utils.udf:67 - Running UDF...
0%| | 0/10 [00:00<?, ?it/s]
Documents: 10 Sentences: 1336
Setup Fonduer datamodel¶
In this step, the data model is created and then used to create the labeling functions and so on. For further information, please refer to the Fonduer documentation.
As we already have some labeled data, we can derive some starting values to create the Fonduer data model. This configuration is highly dependent on the data we have.
It might be beneficial to test the pipeline in advance to make sure Fonduer does not change any document and all annotated spans can be detected. Therefore, we will not spend too much time in setting up labeling functions and only rudimentarily set up some Fonduer processing for now on. After we ensure that the pipeline works for our data, we will come back to that.
# ### Setup Fonduer data model
from fonduer.candidates.models import mention_subclass
Title = mention_subclass("Title")
Date = mention_subclass("Date")
from fonduer.candidates import MentionNgrams
title_ngrams = MentionNgrams(n_max=23, n_min=5)
date_ngrams = MentionNgrams(n_max=13, n_min=3)
print("Title ngram size:", title_ngrams.n_max)
print("Date ngram size:", date_ngrams.n_max)
title = export.lable_entitis("Title")
date = export.lable_entitis("Date")
from fonduer.candidates.matchers import LambdaFunctionMatcher
def is_title(mention):
if mention.get_span() in title:
return True
else:
False
def is_date(mention):
if mention.get_span() in date:
return True
else:
False
title_matcher = LambdaFunctionMatcher(func=is_title)
date_matcher = LambdaFunctionMatcher(func=is_date)
from fonduer.candidates import MentionExtractor
mention_extractor = MentionExtractor(
session,
[Title, Date],
[title_ngrams, date_ngrams],
[title_matcher, date_matcher],
)
from fonduer.candidates.models import Mention
mention_extractor.apply(docs)
num_title = session.query(Title).count()
num_date = session.query(Date).count()
print(f"Total Mentions: {session.query(Mention).count()} ({num_title} titles, {num_date} dates)")
from fonduer.candidates.models import candidate_subclass
TitleDate = candidate_subclass("TitleDate", [Title, Date])
from fonduer.candidates import CandidateExtractor
candidate_extractor = CandidateExtractor(session, [TitleDate])
candidate_extractor.apply(docs)
[2022-11-17 12:48:21,541][INFO] fonduer.candidates.mentions:467 - Clearing table: title [2022-11-17 12:48:21,609][INFO] fonduer.candidates.mentions:467 - Clearing table: date [2022-11-17 12:48:21,617][INFO] fonduer.utils.udf:67 - Running UDF...
Title ngram size: 23 Date ngram size: 13
0%| | 0/10 [00:00<?, ?it/s]
[2022-11-17 12:48:29,910][INFO] fonduer.candidates.candidates:138 - Clearing table title_date (split 0) [2022-11-17 12:48:29,936][INFO] fonduer.utils.udf:67 - Running UDF...
Total Mentions: 24 (15 titles, 9 dates)
0%| | 0/10 [00:00<?, ?it/s]
Create Label Studio Import¶
train_cands = candidate_extractor.get_candidates()
from LabelstudioToFonduer.to_label_studio import ToLabelStudio
converter = ToLabelStudio()
import logging
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)
# export = converter.create_export(candidates=train_cands, fonduer_export_path="import.json")
export = converter.create_export(candidates=train_cands)
import json
print(json.dumps(export[0]["annotations"], indent=4))
/usr/local/lib/python3.7/site-packages/fonduer/candidates/candidates.py:201: SAWarning: Coercing Subquery object into a select() for use in IN(); please pass a select() construct explicitly .filter(candidate_class.id.in_(sub_query))
[ { "model_version": 0, "score": 0, "result": [ { "from_id": 1, "to_id": 3, "type": "relation", "direction": "right", "readonly": false }, { "id": 1, "from_name": "ner", "to_name": "text", "type": "hypertextlabels", "readonly": false, "hidden": false, "score": 0.0, "value": { "start": "/ul[1]/li/a/span[1]", "end": "/ul[1]/li/a/span[1]", "startOffset": 7, "endOffset": 91, "text": "3-D reconstruction of skull suggests a small crocodyliform dinosaur is a new species", "hypertextlabels": [ "Title" ] } }, { "id": 3, "from_name": "ner", "to_name": "text", "type": "hypertextlabels", "readonly": false, "hidden": false, "score": 0.0, "value": { "start": "/p[9]/strong/span", "end": "/p[9]/strong/span", "startOffset": 15, "endOffset": 46, "text": "February 15, 2017 11 AM Pacific", "hypertextlabels": [ "Date" ] } }, { "from_id": 2, "to_id": 3, "type": "relation", "direction": "right", "readonly": false }, { "id": 2, "from_name": "ner", "to_name": "text", "type": "hypertextlabels", "readonly": false, "hidden": false, "score": 0.0, "value": { "start": "/p[11]/strong/span", "end": "/p[11]/strong/span", "startOffset": 6, "endOffset": 90, "text": "3-D reconstruction of skull suggests a small crocodyliform dinosaur is a new species", "hypertextlabels": [ "Title" ] } }, { "id": 3, "from_name": "ner", "to_name": "text", "type": "hypertextlabels", "readonly": false, "hidden": false, "score": 0.0, "value": { "start": "/p[9]/strong/span", "end": "/p[9]/strong/span", "startOffset": 15, "endOffset": 46, "text": "February 15, 2017 11 AM Pacific", "hypertextlabels": [ "Date" ] } } ] } ]