NER_biomedical#

You can see the full contents of this project on GitHub.

<Project Name>#

<1-2 sentences describing the project>

Papers#

How the papers were obtained#

Typically with pubget. We recommend invoking pubget with the --query_file option, and storing a copy of the query file in the project’s directory, or including a copy in the README.md.

<description>

Where the full papers are stored#

Typically on OSF. Please also add a documents/datasets.json file containing the URL where the full pubget dataset can be downloaded, that looks like: [     {     "url": "https://osf.io/download/<...>/"     } ]

<description>

Annotations#

File(s) being annotated:#

  • /projects/<project_name>/documents/<documents_file1_name>.jsonl

    • corresponding file in the pubget output:

      • <pubget_folder_name>/subset_allArticles_labelbuddyData/<documents_file1_name>.jsonl

  • /projects/<project_name>/documents/<documents_file2_name>.jsonl

    • corresponding file in the pubget output:

      • <pubget_folder_name>/subset_allArticles_labelbuddyData/<documents_file2_name>.jsonl

Annotation labels:#

  • <label1>: <description of label1>

  • <label2>: <description of label2>

Labels found in other projects as well:#

  • <label2>

Instructions for annotators#

<description>

Labels in this project#

NER method (9 docs)

Example annotations:

… algorithm to identify such structures and modify the results of our ML-based tagger. This is called pattern-based post-processing. ## Results To develop our ML-based Bio-NER system, we employ conditional random fields, which have performed effectively in several well-known tasks, as our underlying ML model. Adding selected conjunction features, applying numerical normalization, and employing pattern-based post-pro…
…any natural language processing system aimed at managing the wealth of biomedical information that is available electronically. To support term recognition in the biomedical domain, we have developed Termino, a large-scale terminological resource for text processing applications, which has two main components: first, a database into which very large numbers of terms can be loaded from resources such as U…
… has two main components: first, a database into which very large numbers of terms can be loaded from resources such as UMLS, and stored together with various kinds of relevant information; second, a finite state recognizer, for fast and efficient identification and mark-up of terms within text. Since many biomedical applications require this functionality, we have made Termino available to the community as a web servic…
…rmance. The proposed method includes Natural Language Processing (NLP) tasks for text preprocessing, learning word representation features from a large amount of text data for feature extraction, and conditional random fields for token classification. Other than the free text in the domain, the proposed method does not rely on any lexicon nor any dictionary in order to keep the system applicable to other NER tasks in bio-…
… free text in the domain, the proposed method does not rely on any lexicon nor any dictionary in order to keep the system applicable to other NER tasks in bio-text data. ## Results We extended BANNER, a biomedical NER system, with the proposed method. This yields an integrated system that can be applied to chemical and drug NER or biomedical NER. We call our branch of the BANNER system BANNER-CHE…
…ic information are matched to the tokens to give orthographic information. These baseline features are summarized in Table . The baseline features. For word representation features, we train Brown clustering models [ ] and Word Vector (WV) models [ ] on a large PubMed and PMC document collection. Brown clustering is a hierarchical word clustering method, grouping words in an input corpus to maximize the mutual …
…e tokens to give orthographic information. These baseline features are summarized in Table . The baseline features. For word representation features, we train Brown clustering models [ ] and Word Vector (WV) models [ ] on a large PubMed and PMC document collection. Brown clustering is a hierarchical word clustering method, grouping words in an input corpus to maximize the mutual information of bigrams. Therefor…
…arized in Table . The baseline features. For word representation features, we train Brown clustering models [ ] and Word Vector (WV) models [ ] on a large PubMed and PMC document collection. Brown clustering is a hierarchical word clustering method, grouping words in an input corpus to maximize the mutual information of bigrams. Therefore, the quality of a partition can be computed as a sum of mutual information weights between clusters. It run…
…quality of a partition can be computed as a sum of mutual information weights between clusters. It runs in time O(V × K ), where V is the size of the vocabulary and K is the number of clusters. The VW model is induced via a Recurrent Neural Network (RNN) and can be seen as a language model that consists of n -dimensional continuous valued vectors, each of which represents a word in the training corpus. The RNN instance is trained to predict either…
…ns of documents to process within an hour. We used a tool implemented by Mikolov et al. [ ] to build our WV model from the PubMed collection. Further, the word vectors are clustered using a K-means algorithm to drive a Word Vector Class (WVC) model. Since Brown clustering is a bigram model, this model may not be able to carry wide context information of a word, whereas the WVC model is an n -gram mode…
entity (7 docs)

Example annotations:

… into other systems by using the BANNER Unstructured Information Management Architecture (UIMA) interface. BANNER-CHEMDNER achieved an 85.68% and an 86.47% F-measure on the testing sets of CHEMDNER Chemical Entity Mention (CEM) and Chemical Document Indexing (CDI) subtasks, respectively, and achieved an 87.04% F-measure on the official testing set of the BioCreative II gene mention task, showing remarka…
… testing sets of CHEMDNER Chemical Entity Mention (CEM) and Chemical Document Indexing (CDI) subtasks, respectively, and achieved an 87.04% F-measure on the official testing set of the BioCreative II gene mention task, showing remarkable performance in both chemical and biomedical NER. BANNER-CHEMDNER system is available at: . # Body ## Background As biomedical literature on servers grows …
… word classes. Finally, we apply the CRF sequence-labeling method to the extracted feature vectors to train the NER model. These steps will be described in subsequent sections. System design for chemical and drug Named Entity Recognition . The solid lines represent the flow of labeled data, and the dotted lines represent the flow of unlabeled data. ### Preprocessing Preprocessing is where te…
…. Finally, we apply the CRF sequence-labeling method to the extracted feature vectors to train the NER model. These steps will be described in subsequent sections. System design for chemical and drug Named Entity Recognition . The solid lines represent the flow of labeled data, and the dotted lines represent the flow of unlabeled data. ### Preprocessing Preprocessing is where text data i…
Xu, Kai and Zhou, Zhanfan and Gong, Tao and Hao, Tianyong and Liu, Wenyin BMC Med Inform Decis Mak, 2018 # Title SBLC: a hybrid model for disease named entity recognition based on semantic bidirectional LSTMs and conditional random fields # Keywords Biomedical informatics Text mining Machine learning Neural networks # Abstract ## Backgro…
Hemati, Wahed and Mehler, Alexander J Cheminform, 2019 # Title LSTMVoter: chemical named entity recognition using a conglomerate of sequence labeling tools # Keywords BioCreative V.5 CEMP CHEMDNER BioNLP Named entity recognition Deep learning LSTM Attention mechanism # Abstract…
…mework for bacterial named entity recognition with domain features # Keywords Named entity recognition Biomedical text mining Conditional random field Deep learning # Abstract ## Background Microbes have been shown to play a crucial role in various ecosystems. Many human diseases have been proved to be associated with bacteria, so it is essential to extract the interaction between bacteria for m…
…e drug events natural language processing deep learning information extraction adverse drug reaction reporting systems named entity recognition relation extraction # Abstract ## Background An adverse drug event (ADE) is commonly defined as “an injury resulting from medical intervention related to a drug.” Providing information related to ADEs and alerting caregivers at the point of care can reduce the risk o…
…hian Dinani, Soudabeh and Millagaha Gedara, Nuwan Indika and Xu, Xuan and Richards, Emily and Maunsell, Fiona and Zad, Nader and Tell, Lisa A. Front Vet Sci, 2021 # Title Large-Scale Data Mining of Rapid Residue Detection Assay Data From HTML and PDF Documents: Improving Data Access and Visualization for Veterinarians # Keywords MRL and tolerance commercial rapid assay test machine learning large scale data mining table extract…
…ummaries in popup windows containing knowledge related to the identified terms along with links to various databases. It uses the EXTRACT tagging service to perform named entity recognition (NER) for genes/proteins, chemical compounds, organisms, tissues, environments, diseases, phenotypes and gene ontology terms. Multiple files can be analyzed, whereas identified terms such as proteins or genes can be explored through functional enrichment analysis or be associated with diseases and PubMed entries. Finally, …
performance (5 docs)

Example annotations:

…have performed effectively in several well-known tasks, as our underlying ML model. Adding selected conjunction features, applying numerical normalization, and employing pattern-based post-processing improve the F-scores by 1.67%, 1.04%, and 0.57%, respectively. The combined increase of 3.28% yields a total score of 72.98%, which is better than the baseline system that only uses singleton features. ## Conclusion We demonstrate the benefits of using the sequential forward search algorithm to select effective conju…
…cuments per minute, is configurable via XML, and can be plugged into other systems by using the BANNER Unstructured Information Management Architecture (UIMA) interface. BANNER-CHEMDNER achieved an 85.68% and an 86.47% F-measure on the testing sets of CHEMDNER Chemical Entity Mention (CEM) and Chemical Document Indexing (CDI) subtasks, respectively, and achieved an 87.04% F-measure on the official testing set of the BioCreat…
… BANNER-CHEMDNER achieved an 85.68% and an 86.47% F-measure on the testing sets of CHEMDNER Chemical Entity Mention (CEM) and Chemical Document Indexing (CDI) subtasks, respectively, and achieved an 87.04% F-measure on the official testing set of the BioCreative II gene mention task, showing remarkable performance in both chemical and biomedical NER. BANNER-CHEMDNER system is available at: . # Body ## Ba…
…l through comparing with nine state-of-the-art baseline methods including cTAKES, MetaMap, DNorm, C-Bi-LSTM-CRF, TaggerOne and DNER. ## Results The results show that the SBLC model achieves an F1 score of 0.862 and outperforms the other methods. In addition, the model does not rely on external domain dictionaries, thus it can be more conveniently applied in many aspects of medical text processing. ## Con…
…mechanism. LSTMVoter outperforms each extractor integrated by it in a series of experiments. On the BioCreative IV chemical compound and drug name recognition (CHEMDNER) corpus, LSTMVoter achieves an F1-score of 90.04%; on the BioCreative V.5 chemical entity mention in patents corpus, it achieves an F1-score of 89.01%. ## Availability and implementation Data and code are available at . # Body ## Intr…
…ioCreative IV chemical compound and drug name recognition (CHEMDNER) corpus, LSTMVoter achieves an F1-score of 90.04%; on the BioCreative V.5 chemical entity mention in patents corpus, it achieves an F1-score of 89.01%. ## Availability and implementation Data and code are available at . # Body ## Introduction In order to advance the fields of biological, chemical and biomedical research, it is im…
…cognition, which integrates domain features into a deep learning framework combining bidirectional long short-term memory network and convolutional neural network. When domain features are not added, F1-measure of the model achieves 89.14%. After part-of-speech (POS) features and dictionary features are added, F1-measure of the model achieves 89.7%. Hence, our model achieves an advanced performance in bacterial NER with the domain feat…
…rm memory network and convolutional neural network. When domain features are not added, F1-measure of the model achieves 89.14%. After part-of-speech (POS) features and dictionary features are added, F1-measure of the model achieves 89.7%. Hence, our model achieves an advanced performance in bacterial NER with the domain features. ## Conclusions We propose an efficient method for bacterial named entity recognition which combine…
document type (4 docs)

Example annotations:

…7.04% F-measure on the official testing set of the BioCreative II gene mention task. ## Methods Our chemical and drug NER system design is shown in Figure . First, we perform preprocessing on MEDLINE and PMC document collection and then extract two different feature sets, a base feature set and a word representation feature set, in the feature processing phase. The unlabeled set of the collection is fed to unsupervised lear…
… is essential to extract the interaction between bacteria for medical research and application. At the same time, many bacterial interactions with certain experimental evidences have been reported in biomedical literature. Integrating this knowledge into a database or knowledge graph could accelerate the progress of biomedical research. A crucial and necessary step in interaction extraction (IE) is named entity recogn…
…ecords (EHRs) as either coded problems or allergies are often incomplete, leading to underreporting. Therefore, it is important to develop capabilities to process unstructured EHR data in the form of clinical notes, which contain a richer documentation of a patient’s ADE. Several natural language processing (NLP) systems have been proposed to automatically extract information related to ADEs. However, the resul…
…biomedical terms from such files in an automated way is absolutely necessary. In this article, we present OnTheFly , a web application for extracting biomedical entities from individual files such as plain texts, office documents, PDF files or images. OnTheFly can generate informative summaries in popup windows containing knowledge related to the identified terms along with links to various databases. It uses the EXTRACT tagging service to perfo…
processing (pre or post) (2 docs)

Example annotations:

…an be beneficial, but it would be infeasible to include all conjunction features in an NER model since memory resources are limited and some features are ineffective. To resolve the problem, we use a sequential forward search algorithm to select an effective set of features. Second, variations in the numerical parts of biomedical terms (e.g., "2" in the biomedical term IL2) cause data sparseness and generate many redundant features…
…et of features. Second, variations in the numerical parts of biomedical terms (e.g., "2" in the biomedical term IL2) cause data sparseness and generate many redundant features. In this case, we apply numerical normalization, which solves the problem by replacing all numerals in a term with one representative numeral to help classify named entities. Third, the assignment of NE tags does not depend solely on the target wo…
…g and two subsequent words). We use global patterns generated by the Smith-Waterman local alignment algorithm to identify such structures and modify the results of our ML-based tagger. This is called pattern-based post-processing. ## Results To develop our ML-based Bio-NER system, we employ conditional random fields, which have performed effectively in several well-known tasks, as our underlying ML model. Adding selecte…
…labeled data. ### Preprocessing Preprocessing is where text data is cleaned and processed via NLP tasks and is a preparatory task for feature processing. First, the text data is cleansed by removing non-informative characters and replacing special characters with corresponding spellings. The text is then tokenized with a tokenization tool. We evaluated two different tokenization strategies: a simple white space tokenizer …
…Preprocessing is where text data is cleaned and processed via NLP tasks and is a preparatory task for feature processing. First, the text data is cleansed by removing non-informative characters and replacing special characters with corresponding spellings. The text is then tokenized with a tokenization tool. We evaluated two different tokenization strategies: a simple white space tokenizer and the BANNER simple tokenizer. The white space tokenizer spl…
…is a preparatory task for feature processing. First, the text data is cleansed by removing non-informative characters and replacing special characters with corresponding spellings. The text is then tokenized with a tokenization tool. We evaluated two different tokenization strategies: a simple white space tokenizer and the BANNER simple tokenizer. The white space tokenizer splits the text simply, based o…
features (1 docs)

Example annotations:

…enizer splits the text simply, based on blanks within it, whereas the BANNER tokenizer breaks tokens into either a contiguous block of letters and/or digits or a single punctuation mark. Finally, the lemma and the part-of-speech (POS) information were obtained for a further usage in the feature extraction phase. In BANNER-CHEMDNER, BioLemmatizer [ ] was used for lemma extraction, which resulted in a si…
…the text simply, based on blanks within it, whereas the BANNER tokenizer breaks tokens into either a contiguous block of letters and/or digits or a single punctuation mark. Finally, the lemma and the part-of-speech (POS) information were obtained for a further usage in the feature extraction phase. In BANNER-CHEMDNER, BioLemmatizer [ ] was used for lemma extraction, which resulted in a significant improvement in overall system p…
location (1 docs)

Example annotations:

…for drug names or other fields were observed while extracting data tables. ### Desired Information From Structured or Unstructured Documents Below we review multiple cases to extract data from tables. For these cases, it is required to check if the keywords determined important in the real-time data collection via PDF and webpage parsing are clearly characterized in the extracted tables provided …
…g beta-lactams, tetracyclines, aminoglycosides, and sulfonamides from the website ( ). Therefore, using our trained model based on the Python packages of requests and BeautifulSoup , all the rapid assay URL links for dairy tests are parsed and automatically examined for potentially available tables on each page. Below we presented an example of adaptable parsing of real-time data extracting. When the query pin…
exclude (1 docs)

Example annotations:

Xing, Yuting and Wu, Chengkun and Yang, Xi and Wang, Wei and Zhu, En and Yin, Jianping Molecules, 2018 # Title ParaBTM: A Parallel Processing Framework for Biomedical Text Mining on Supercomputers # Keywords biomedical text mining big data Tianhe-2 parallel computing load balancing # Abstract A prevailing way o…
interesting (1 docs)

Example annotations:

…ications are helpful in finding the gene, disease, chemical, drugs, protein entities. Finding entities relation such as gene–gene entities, drug-disease interaction, and chemical protein relation the PubExN can be helpful for these types of biomedical applications. In most cases, domain experts do this extraction process on their own. Human interference makes this process time-consuming and there is a h…
note (0 docs)

(No annotations with this label in the current project)