Consultramy Semantic Web: text mining

Showing posts with label text mining. Show all posts

Wednesday, June 1, 2011

New Products Announced at SemTech 2011

New Products Announced at SemTech 2011 http://www.prweb.com/releases/2011/6/prweb8497129.htm
Leading industry companies will unveil and debut the newest products paving the way in semantic technology.

New York, New York (PRWEB) June 01, 2011
Mediabistro.com (a division of WebMediaBrands Inc., Nasdaq: WEBM) today announced new product releases that will be revealed at the Semantic Technology Conference (#SemTech), the world’s largest conference on the commercialization of semantic technologies, taking place June 5-9, 2011 at the San Francisco Hilton in Union Square.
SemTech is the preferred industry platform for exhibitors to announce product launches and breaking news. Attendees will have the rare opportunity to view products and services from top industry insiders including the following:
Revelytix will discuss several products, including: Spyder - a Relational to RDF conversion tool, Spinner - a SPARQL federation tool, and Rex - a RIF rules engine. Together these tools transform the information management capabilities of any enterprise.
Oracle will show how semantic tools within Oracle Database can effectively store, manage, inference and query RDF/OWL data for enterprise applications.
Ontotext will present the Web Mining Framework, (WMF) which involves a process of focused web crawling, screen scraping, text-mining, normalization, data merging, and de-duplication, resulting in normalized, structured data.
Inform Technologies will launch the Inform AdContext Service, which uses semantic metadata to fine-tune ad selection and make ads more topically relevant to the content.
Clark & Parsia will show how Pellet 3 (a leading OWL 2 reasoner) and Stardog (the new, world-class RDF database featuring fast SPARQL query performance) can be used to build fast and scalable semantic applications for the enterprise.
Cambridge Semantics will demonstrate how non-technical business users can combine Microsoft Excel and Anzo ETL to intuitively create mappings from an existing clinical database to the industry standard SDTM ontology and do live analysis.
Cray, Inc. will launch the Cray XMT System, designed specifically to run challenging big data graph analytics workloads that bring traditional systems to their knees.
Pragmatech will debut CTRL, a semantic engine that goes beyond words into concepts, which are then composed into topics that are subsequently analyzed to identify the ‘key’ topics that describe what a certain document is about.
ai-one will debut the Topic-Mapper SDK for text, enabling creation of intelligent applications that deliver better capabilities for semantic discovery, lightweight ontologies, knowledge collaboration, sentiment analysis, AI and data mining.
Talis will present Kasabi, a new web application that aims to support organisations in the publishing and monetization of data on the web.
Protégé will provide updated information on the latest enhancements to the tool and a description of WebProtégé, the web-based version that provides lightweight ontology editing directly in your ontology browser.
Knowledge Hives will introduce Civet, which uses NLP techniques to identify and analyze keywords in text; map them to concepts from vocabularies such as WordNet; and deliver an RDFa document with key words, phrases and names referencing Linked Data concepts.
Semantrix will debut its SM3 Social Multimedia Metadata Manager, which delivers enhanced content value through metadata extraction, annotation and cross-referencing using NLP, Search, Ontology and proprietary concept extraction.
MIT and Zepheira will show the latest work on Exhibit 3.0, fixing many shortcomings of the original, popular tool from the MIT Simile Project, making it far more scalable, modular, and feature rich.
To register for the conference, request a press pass, or to view the program schedule, visit http://semtech2011.semanticweb.com
2011 SemTech sponsors, leading vendors, and developers will demonstrate dozens of innovations at the SemTech Expo Hall.
They include Ontotext, Oracle, Revelytix, Elsevier, Fluid Operations, iQser, Ontoprise, OpenAmplify, OpenText, Orbis, TopQuadrant, XSB, Cognition, Morgan Kaufmann, Expert System, Tom Sawyer Software, Semantic Valley, Semantifi, Liaison Technologies, DERI, Franz, Semantic Arts, Semsphere, and more.
For sponsorship and exhibit information, contact:
Frank Fazio
Senior Director of Sales, Events
203-662-2887
eventsales(at)webmediabrands(dot)com
About WebMediaBrands Inc.
WebMediaBrands Inc. (Nasdaq: WEBM) (http://www.webmediabrands.com), headquartered in New York, NY, is a leading Internet media company that provides content, education, and career services to media and creative professionals through a portfolio of vertical online properties, communities, and trade shows. The Company's online business includes: (i) mediabistro.com, a leading blog network providing content, education, community, and career resources (including the industry's leading online job board) about major media industry verticals including new media, social media, Facebook, TV news, sports news, advertising, public relations, publishing, design, mobile, and the Semantic Web; and (ii) AllCreativeWorld.com, a leading network of online properties providing content, education, community, career, and other resources for creative and design professionals. The Company's online business also includes community, membership and e-commerce offerings including a freelance listing service, a marketplace for designing and purchasing logos and premium membership services. The Company's trade show and educational offerings include conferences, online and in-person courses, and video subscription libraries on topics covered by the Company's online business.
All WebMediaBrands press releases are here:
http://www.webmediabrands.com/corporate/press.html
For information about WebMediaBrands contact:
Amanda Barrett
Director of Marketing
212-547-7879
press(at)webmediabrands(dot)com

Tuesday, March 8, 2011

Is Mobile BI Worth the Hype?

By Ramy Ghaly

If yours is like the average business, chances are your need for mobile analytics is, frankly, limited. Most companies are toward the wide end of the funnel shown in the graphic below, with many employees using mobile devices for mundane matters like email access (the classic "BlackBerry use case"); perhaps a few of them benefit from high-end functions like alerts, and even fewer from interactive mobile analytics.

For users to benefit from mobile BI, they must be able to navigate dashboards and guided analytics comfortably -- or as comfortably as the mobile device will allow, which is where devices with high-resolution screens and touch interfaces (like the iPhone and Android-based phones) have a clear edge over, say, earlier editions of BlackBerry.

The next time your BI vendors touts its mobile business intelligence capabilities, invite them to present use cases that demonstrate these capabilities as implemented at other customers.

Read More: "InformationWeek" Via ctrl-News

Thursday, March 3, 2011

HP To Acquire Analytics Specialist Vertica

By Ramy Ghaly

The buyout will help HP counter IBM's recent acquisition of Netezza as analytics sector heats up.

Hewlett-Packard said Monday it agreed to acquire Vertica, a privately-held developer of software that lets businesses analyze and interpret information stored in enterprise databases. The move should help HP keep pace with rival IBM, which recently bolstered its analytics portfolio with the buyout of Netezza.

HP officials said the deal will help enterprise customers cope with vastly increasing amounts of information coming into their organizations—through the Web, mobile phones, smart devices, and other sources.

IBM enhanced its analytics portfolio late last year with the $1.7 billion acquisition of Netezza, which bundles analytics software with specialized hardware.

HP said it expects the deal to close in the second quarter.

Read More "Information Week" Via ctrl-News

Wednesday, January 12, 2011

2011 Digital Trends #6 - Data and artificial intelligence "algorithm"

January 2011

This is a series of posts on my take on the 11 digital trends for 2011. This sixth trend is on data and AI. The trends are not in order of importance.

In the January 2011 edition of Wired Magazine, the feature The AI Revolution Is On says “By using probability-based algorithms to derive meaning from huge amounts of data, researchers discovered that they didn’t need to teach a computer how to accomplish a task; they could just show it what people did and let the machine figure out how to emulate that behavior under similar circumstances.”

On the web, artificial intelligence is already part of many websites we visit daily: Google’s search, Facebook recommendation of friends and the top news feed, Amazon’s recommendation engine, Pandora’s and Last.fm’s recommendation engine. These AI (some call it algorithm, which I think isn’t correct) derive their “intelligence” from massive amount of data generated by the huge user base, rather than mimicking human intelligence.

AI is not analytics like Google Analytics or Facebook Insights. The goal of analytics is to offer you data for analysis and conclusions, while AI is to offer your users something actionable or tangible like a recommendation.

How is this relevant to marketing? Recommendation and personalization engine is the first in mind. However, it is clearly very costly and difficult, and you may not have a huge data set to begin with. I think the way to go is to rely on third party services, via API as discussed on Trend #5.

The Facebook Recommendations plugin is one of the first to offer such service. It uses data from user interaction with the target website and the user’s friends’ interaction. I believe it takes into account the user Social Graph as well.

Google has a whole library of APIs for its various services, and a few have AI features. Use at your own risk though. Surprisingly I find Google APIs quite unstable because there’s so many and they seem to be created by different teams (they are, actually). If you are using Google AdSense, the AdSence API allows you to generate codes that display contextual ads. The YouTube Data API has some form of recommendation API as well that you can query for “related videos”, and the Google Language API deals with translation.

My 2011 Watch List for data and AI:

Facebook Social Graph – I hope that the next step is to provide a set of APIs that give third parties AI capabilities such as recommendations based on the data from Facebook itself and the third party sites.
Foursquare - I am interested to see how Foursquare’s location data can be used for recommendations and personalization beyond the “Trending Now” feature. Or does it need to be integrated with Facebook Social Graph?

Via: http://blog.campaignasia.com/yonghwee/2011-digital-trends-6-data-and-artificial-intelligence-algorithm/

Mining Wikipedia with Hadoop and Pig for Natural Language Processing

The context: semantic knowledge extraction from unstructured text

In a previous post we introduced fise an open source semantic engine now being incubated at the ASF under the new name: Apache Stanbol. Here is a 4 minute demo that explains how such a semantic engine can be used by a document management system such as Nuxeo DM to tag documents with entities instead of potentially ambiguous words:

The problem with the current implementation, which is based on OpenNLP, is the lack of readily available statistical models for Named Entity Recognition in languages such as French. Furthermore, the existing models are restricted to the detection of few entity classes (right now the English models can detect people, place and organization names).

To build such a model, developers have to teach or train the system by applying a machine learning algorithm on an annotated corpus of data. It is very easy to write such a corpus forOpenNLP: just pass it a text file with one sentence per line, where entity occurrences are located using the START and END tags, for instance:

<START:person> Olivier Grisel <END> is working on the <START:software> Stanbol <END> project .

The only slight problem is to somehow convince someone to spend hours manually annotating hundred of thousands of sentences from text on various topics such as business, famous people, sports, science, literature, history... without making too many mistakes.

Mining Wikipedia in the cloud

Instead manually of annotating text, one should try to benefit from an existing annotated and publicly available text corpus that deals with a wide range of topics, namely Wikipedia.

Our approach is rather simple: the text body of Wikipedia articles is rich in internal links pointing to other Wikipedia articles. Some of those articles are referring to the entity classes we are interested in (e.g. person, countries, cities, ...). Hence we just need to find a way to convert those links into entity class annotations on text sentences (without the Wikimarkup formatting syntax).

To find the type of the entity described by a given Wikipedia article, one can use the category information as described in this paper by Alexander E. Richman and Patrick Schone . Alternatively we can use the semi-structured information available in the Wikipedia infoboxes. We decided to go for the latter by reusing the work done by the DBpedia project:

DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. We hope this will make it easier for the amazing amount of information in Wikipedia to be used in new and interesting ways, and that it might inspire new mechanisms for navigating, linking and improving the encyclopedia itself.

More specifically we will use a subset of the DBpedia RDF dumps:

instance_types_en.nt to relate a DBpedia entity ID to its entity class ID
page_links_en.nt to relate a Wikipedia article entity ID to its entity class ID

The mapping from a Wikipedia URL to a DBpedia entity ID is also available in 12 languages (en, de, fr, pl, ja, it, nl, es, pt, ru, sv, zh) which should allow us to reuse the same program to build statistical models for all of them.

Hence to summarize, we want a program that will:

parse the Wikimarkup of a Wikipedia dump to extract unformatted text body along with the internal wikilink position information;
for each link target, find the DBpedia ID of the entity if available (this is the equivalent of a JOIN operation in SQL);
for each DBpedia entity ID find the entity class ID (this is another JOIN);
convert the result to OpenNLP formatted files with entity class information.

In order to implement this, we started a new project called pignlproc, licensed under ASL2. The source code is available as a github repository. pignlproc uses Apache Hadoop for distributed processing, Apache Pig for high level Hadoop scripting and Apache Whirr to deploy and manage Hadoop on a cluster of tens of virtual machines on the Amazon EC2cloud infrastructure (you can also run it locally on a single machine of course).

Detailed instructions on how to run this yourself are available in the README.md and theonline wiki.

Parsing the wikimarkup

The script performs the first step of the program, namely parsing & cleaning up the wikimarkup and extracting the sentences and link positions. This script uses somepignlproc specific User Defined Functions written in java to parse the XML dump, parse the Wikimarkup syntax using the bliki wiki parser and detect sentence boundaries using OpenNLP - all of this while propagating the link positioning information.

-- Register the project jar to use the custom loaders and UDFsREGISTER $PIGNLPROC_JARparsed = LOAD '$INPUT'  USING pignlproc.storage.ParsingWikipediaLoader('$LANG')  AS (title, wikiuri, text, redirect, links, headers, paragraphs);-- filter and project as early as possiblenoredirect = FILTER parsed by redirect IS NULL;projected = FOREACH noredirect GENERATE title, text, links, paragraphs;-- Extract the sentence contexts of the links respecting the paragraph-- boundariessentences = FOREACH projected  GENERATE title, flatten(pignlproc.evaluation.SentencesWithLink(    text, links, paragraphs));stored = FOREACH sentences  GENERATE title, sentenceOrder, linkTarget, linkBegin, linkEnd, sentence;-- Ensure ordering for fast merge with type info laterordered = ORDER stored BY linkTarget ASC, title ASC, sentenceOrder ASC;STORE ordered INTO '$OUTPUT/$LANG/sentences_with_links';

We store the intermediate results on HDFS for later reuse by the last script in step 4.

Extracting entity class information from DBpedia

The second script is doing step 2 and step 3 (the joins) on the DBpedia dumps. This script also uses some pignlproc specific tools to quickly parse NT triples while filtering out those that are not interesting:

-- Load wikipedia, instance types and redirects from DBpedia dumpswikipedia_links = LOAD '$INPUT/wikipedia_links_$LANG.nt' USING pignlproc.storage.UriUriNTriplesLoader(    'http://xmlns.com/foaf/0.1/primaryTopic')  AS (wikiuri: chararray, dburi: chararray);wikipedia_links2 =  FILTER wikipedia_links BY wikiuri IS NOT NULL;-- Load DBpedia type data and filter out the overly generic owl:Thing typeinstance_types =  LOAD '$INPUT/instance_types_en.nt'  USING pignlproc.storage.UriUriNTriplesLoader(    'http://www.w3.org/1999/02/22-rdf-syntax-ns#type')  AS (dburi: chararray, type: chararray);instance_types_no_thing =  FILTER instance_types BY type NEQ 'http://www.w3.org/2002/07/owl#Thing';joined = JOIN instance_types_no_thing BY dburi, wikipedia_links2 BY dburi;projected = FOREACH joined GENERATE wikiuri, type;-- Ensure ordering for fast merge with sentence linksordered = ORDER projected BY wikiuri ASC, type ASC;STORE ordered INTO '$OUTPUT/$LANG/wikiuri_to_types';

Again we store the intermediate results on HDFS for later reuse by other scripts.

Merging and converting to the OpenNLP annotation format

Finally the last script takes as input the previously generated files and an additional mapping from DBpedia class names to their OpenNLP counterpart, for instance:

http://dbpedia.org/ontology/Person  personhttp://dbpedia.org/ontology/Place   locationhttp://dbpedia.org/ontology/Organisation    organizationhttp://dbpedia.org/ontology/Album   albumhttp://dbpedia.org/ontology/Film    moviehttp://dbpedia.org/ontology/Book    bookhttp://dbpedia.org/ontology/Software    softwarehttp://dbpedia.org/ontology/Drug    drug

The PIG script to do the final joins and conversion to the OpenNLP output format is the following. Here again pignlproc provides some UDFs for converting the pig tuple & bag representation to the serialized format accepted by OpenNLP:

SET default_parallel 40REGISTER $PIGNLPROC_JAR-- use the english tokenizer for other European languages as wellDEFINE opennlp_merge pignlproc.evaluation.MergeAsOpenNLPAnnotatedText('en');sentences = LOAD '$INPUT/$LANG/sentences_with_links'  AS (title: chararray, sentenceOrder: int, linkTarget: chararray,      linkBegin: int, linkEnd: int, sentence: chararray);wikiuri_types = LOAD '$INPUT/$LANG/wikiuri_to_types'  AS (wikiuri: chararray, typeuri: chararray);-- load the type mapping from DBpedia type URI to OpenNLP type nametype_names = LOAD '$TYPE_NAMES' AS (typeuri: chararray, typename: chararray);-- Perform successive joins to find the OpenNLP typename of the linkTargetjoined = JOIN wikiuri_types BY typeuri, type_names BY typeuri USING 'replicated';joined_projected = FOREACH joined GENERATE wikiuri, typename;joined2 = JOIN joined_projected BY wikiuri, sentences BY linkTarget;result = FOREACH joined2 GENERATE title, sentenceOrder, typename, linkBegin, linkEnd, sentence;-- Reorder and group by article title and sentence orderordered = ORDER result BY title ASC, sentenceOrder ASC;grouped = GROUP ordered BY (title, sentenceOrder);-- Convert to the OpenNLP training formatopennlp_corpus =FOREACH groupedGENERATE opennlp_merge(   ordered.sentence, ordered.linkBegin, ordered.linkEnd, ordered.typename);STORE opennlp_corpus INTO '$OUTPUT/$LANG/opennlp';

Depending the size of the corpus and the number of nodes you are using, the length of each individual job will run from a couple of minutes to a couple of hours. For instance, the first steps for parsing 3GB of Wikipedia XML chunks on 30 small EC2 instances will typically take between 5 and 10 minutes.

Some preliminary results

Here is a sample of the output on the French Wikipedia dump for location detection only:

http://pignlproc.s3.amazonaws.com/corpus/fr/opennlp_location/part-r-00000

You can replace "location" by "person" or "organization" in the previous URL for more examples. You can also replace "part-r-00000" by "part-r-000XX" to download larger chunks of the corpus. You can also replace "fr" by "en" to get English sentences.

By concatenating chunks of each corpus to into files of ~100k lines one can get reasonably sized input files for the OpenNLP command line tool:

$ opennlp TokenNameFinderTrainer -lang fr -encoding utf-8 \   -iterations 50  -type location -model fr-ner-location.bin \   -data ~/data/fr/opennlp_location/train

Here are the resulting models:

It is possible to retrain those models on a larger subset of chunks by allocating more than 2GB of heap-space to the OpenNLP CLI tool (I used version 1.5.0). To evaluate the performance of the trained models you can run the OpenNLP evaluator on a separate part of the corpus (commonly called the testing set):

$ opennlp TokenNameFinderEvaluator -encoding utf-8 \   -model fr-ner-location \   -data ~/data/fr/opennlp_location/test

The corpus is quite noisy so the performance of the trained models is not optimal (but better than nothing anyway). Here is the result of evaluations on held out chunks of the French corporas (+/- 0.02):

Performance evaluation for NER on a French extraction with 100k sentences

class	precision	recall	f1-score
location	0.87	0.74	0.80
person	0.80	0.68	0.74
organization	0.80	0.65	0.72

Performance evaluation for NER on a English extraction with 100k sentences

class	precision	recall	f1-score
location	0.77	0.67	0.71
person	0.80	0.70	0.75
organization	0.79	0.64	0.70

The results of this fist experiment are interesting, but lower than the state of the art, especially for the recall values. The main reason is that there are many sentences in Wikipedia that hold entities that do not carry a link.

A potential way to improve this would be to set up a sort of active learning tooling where the trained models are reused to suggest missing annotations to a human validator to be quickly accepted or rejected so as to improve the quality the corpus and then the quality of the following generation of models until the corpus reaches the quality of the fully manually annotated one.

Future work

I hope that this first experiment could convince some of you of the power of combining tools such as Pig, Hadoop and OpenNLP for batch text analytics. By advertising the project on the OpenNLP users mailing list, we already got some very positive feedback. It is very likely thatpignlproc will get contributed one way or another to the OpenNLP project.

These tools are, of course, not limited to training OpenNLP models, and it will be very easy to adapt the code of the conversion UDF to generate BIO formatted corpora to be used by other NLP libraries such as NLTK for instance.

Finally, there is no reason to limit this processing to NER corpora generation. Similar UDFs and scripts could be produced to identify text sentences that express in a natural language the relationships that link entities and that have already been extracted in a structured manner from the infoboxes by the DBpedia project.

Such a new corpus would be of great value for developing and evaluating the quality of automated entity relationships and properties extraction. Such a new extractor could potentially be based on syntactic parsers such as the one available in OpenNLP orMaltParser.

Credits

This work was funded by the Scribo and IKS R&D projects. We also would like to thank all the developers of the involved projects.

Consultramy Semantic Web

Welcome To Cosultramy's Semantic Web Blog