Welcome To Cosultramy's Semantic Web Blog

let me know what you think, and let's get social with this. Welcome to connect through Twitter at http://www.twitter.com/consultramy

Monday, January 31, 2011

Does Social Media Sets “New World Order” in building Revolutions throughout the Middle East?

BY consultramy

While the uprising started in Tunisia requesting from a long dictator to step down, Egypt is following suite of the recent Tunisian's Jasmine Revolution seeking reform, freedom, and respect by using social media to organize, communicate, and express their feelings not only to the world, but to say to their dictators "this is the end of your rule, and now is our time to make the change weather you like it or not".
The young catalysts of Twitter, Facebook, and social media are behind the driving force of the revolution not only in Tunisia and Egypt, but also in Jordan and Yemen where the role of social media has proven a new world order in digital social communications to bring down totalitarian regimes all over the Middle East. Nonstop-able young embryonic youth, the educated elite of the Arab world are speaking out seeking freedom. Yes, freedom is a high commodity especially in the Arab world where no real democracies exists; yet decades of totalitarian regimes that took the power of their people and turned them into modern slaves. Moreover, the blocked elite of Egypt are speaking out after thirty years of unprecedented selfish rule that stole their dignities by keeping them away from engaging in decision making, take responsibilities of their future are now fighting for their pride in setting the road map to gain freedom, the right to speak out, needless to say, gain respect by re-writing history that makes them proud citizens of their forced change to set a better example not only to Egypt, but to other dictators in the Middle East and the world. Now, social media set fear in dictatorships throughout the Arab world where Egypt's dictatorships shut down all social networks and communications to put more pressure on the young youth to stop their uprising, but that not only turned against the totalitarian regime, instead, it sparked more anger and gave the youth more reason not to give up that easily; yet embrace the fight for freedom to the end while Mr. Mubarak is witnessing his last moments of the 30 years old rule.
Going social in the digital world is building a new world order by driving change and bringing down dictators at least in the Arab world. The young blocked elite of Egypt are seeking change from regime reform, free elections, and the ability to choose their own leaders, to requesting their basic civil rights to be heard and respected; yet achieving it is even more challenging when chaos takes over in the streets and sets a whole new ball game. Social media is playing a vital role to make change possible and give these young rebels a tool that it was unprecedented before having the ability to bring down dictators and set a new world order in social digital communications.
Important facts about Egypt according to CIA FactBook's latest figures: 2010
Population: 80 Million
Median age: 24 years
GDP: Approximately $216.8 billion (2009)
Per capita income: $6,200 (GDP/year-PPP 2010)
Unemployment: 9.7% (est.)
Poverty: 40%
Strategic interests to the world:
  • Israeli/Egyptian Peace Treaty signed In 1979.
  • It is estimated that 10 percent of the global crude oil demand passes through the Suez Canal.
  • The biggest population in the Middle East
  • Egypt receives nearly $2-$3 billion in aid per year from the United States.
  • Egypt holds ancient treasures and artifacts.
  • The Pyramids are one of the top wonders of the world.
  • A global destination for tourists making it one of the top tourist markets in the world.
Popular Hash Tags used for the "uprising in Egypt" on Twitter:
#liberation technology
#social freedom
#twitter revolution
#SM revolution
#civil liberty
#Mona Altahawy
Latest Sentiment analysis of Egypt's uprising since it started six days ago on January 25, 2011
About the Author:
Ramy Ghaly is a Marketing Strategist with more than ten years in international markets experience. He held professional and managerial positions in multiple global markets in various industries ranging from retail, wholesale, consumer goods, to technology product management with concentration in channel development. In addition, He holds a degree in International Marketing Management with a minor in International Relations and Middle Eastern studies from Daytona State College. He is interested in social media developments, next generation search technologies, semantic search engines, and text analytics; needless to say, strategies in geopolitics, Middle Eastern Studies, and Environmental factors that affect global business growth are general interests that keen to always monitor and encourage writing about.

Friday, January 28, 2011

What are the most challenging issues in Sentiment Analysis(opinion mining)?

Ramy Ghaly January 28, 2011

Hossein Said:

Opinion Mining/Sentiment Analysis is a somewhat recent subtask of Natural Language processing.Some compare it to text classification,some take a more deep stance towards it. What do you think about the most challenging issues in Sentiment Analysis(opinion mining)? Can you name a few?


Hightechrider Said:

The key challenges for sentiment analysis are:-

1) Named Entity Recognition - What is the person actually talking about, e.g. is 300 Spartans a group of Greeks or a movie?

2) Anaphora Resolution - the problem of resolving what a pronoun, or a noun phrase refers to. "We watched the movie and went to dinner; it was awful." What does "It" refer to?

3) Parsing - What is the subject and object of the sentence, which one does the verb and/or adjective actually refer to?

4) Sarcasm - If you don't know the author you have no idea whether 'bad' means bad or good.

5) Twitter - abbreviations, lack of capitals, poor spelling, poor punctuation, poor grammar, ...


ealdent Said:

I agree with Hightechrider that those are areas where Sentiment Analysis accuracy can see improvement. I would also add that sentiment analysis tends to be done on closed-domain text for the most part. Attempts to do it on open domain text usually winds up having very bad accuracy/F1 measure/what have you or else it is pseudo-open-domain because it only looks at certain grammatical constructions. So I would say topic-sensitive sentiment analysis that can identify context and make decisions based on that is an exciting area for research (and industry products).

I'd also expand his 5th point from Twitter to other social media sites (e.g. Facebook, Youtube), where short, ungrammatical utterances are commonplace.


Skarab Said:

I think the answer is the language complexity, mistakes in grammar, and spelling. There is vast of ways people expresses there opinions, e.g., sarcasms could be wrongly interpreted as extremely positive sentiment.


What do you think? Do you agree? Would you like to ask a question and get an answer? Try out: Q&A for professional and enthusiast programmers 



Friday, January 21, 2011

Social commerce to surge

NEW YORK: Social commerce sales will rise dramatically during the next five years, encouraging brands and retailers to enhance their presence on sites like Facebook, Booz & Co has argued.

In a new report, the consultancy stated marketers must shift the terms of engagement with consumers using Web 2.0 properties from "like" to "buy".

"The market for social commerce has been embryonic to date, but that will change over the next five years as companies race to establish stores," it said.

"Trendsetting companies are focused on products and services that benefit from the unique characteristics of social media, including the opportunity to get quick feedback from multiple friends and family members."

The study praised 1-800-Flowers, which boasts a fully-functioning Facebook store allowing customers to complete purchases without leaving the network's pages.

It has also implemented other innovative strategies, for example linking Facebook's calendar and "group gifting" features to its Mother's Day campaign.

"We are going to continue to invest in certain areas to help drive future growth," Bill Shea, 1-800-Flowers' chief financial officer, said in late 2010.

"Whether it be franchising efforts for both the consumer floral and our food group, investments in mobile and social commerce [or] floral supply chain in Celebrations.com, we are going to continue to invest."

Dell was cited as another pioneering early-adopter, having earned millions of dollars in revenue through Twitter.

The IT specialist is becoming increasingly active in the smartphone and tablet segments, which the organisation believes will transform the retail sector.

"It used to be 'We're going to tell you how you're going to experience our store,'" said Brian Slaughter, Dell's director, end-user solutions, large enterprises.

"Now the consumer is walking in and saying: 'No, I'm going to tell you how I'm going to use your store to give me more information.' The tools they have at their disposal are very cool."

Similarly, Quidsi, owned by Amazon, recently set up Facebook outlets for its Soap.com and Diapers.com platforms, although the ability to make purchases is limited to members of these two portals.

"No one has yet cracked the nut on Facebook e-commerce," said Josh Himwich, Quidsi's vp, ecommerce.

Overall, Booz estimated sales of physical goods via social channels would hit $5bn (€3.7bn; £3.1bn) globally in 2011, with the US contributing 20% of this total.

Revenues were pegged to reach $9bn by the close of 2012, incorporating $3bn generated by American internet users.

Such figures should achieve $14bn and $5bn respectively for 2013, while US customers deliver nearly half of the $20bn returns yielded in 2014.

By 2015, the worldwide expenditure attributable to this medium is anticipated to come in at $30bn, housing $14bn from the world's biggest economy.

A previous Booz survey of netizens dedicating one hour or more a month to social networks, and who bought at least one product online in the last year, found 20% proved willing to pay for items through these sites.

Elsewhere, 10% suggested this spending would be incremental to their current outlay, but 71% added "liking" a brand on Facebook did not improve the probability of buying it.

The consultancy predicted that social media will have the greatest impact on consideration, conversion, loyalty and customer service.

Facebook's chief executive Mark Zuckerburg certainly supported such as optimistic reading when rolling out the Places geo-location system last year.

"If I had to guess, social commerce is the next area to really blow up," he said in August. 

Data sourced from Booz & Co, Seeking Alpha, Daily Finance; additional content by Warc staff, 21 January 2011


Thursday, January 20, 2011

Enterprise application software market in Middle East and North Africa to rebound from significant slowdown, says IDC

United Arab Emirates: Sunday, January 16 - 2011 

The Middle East and North Africa (MENA) enterprise application software (EAS) market is expected to return to double-digit growth rates from 2010 after suffering a considerable slowdown in growth in 2009 due to the impact of the global economic crisis on the region. According to a recent report by market research company IDC, the region is predicted to expand at an average annual growth rate of 12.8% by 2014.

Widespread regional liquidity difficulties and the delay or cancellation of EAS projects by organizations that were forced to revisit their spending plans proved particularly troublesome, although the market will soon rebound to former heights," says Dhiraj Daryani, senior analyst for the software market at IDC Middle East, Africa, and Turkey. 

As the global economic recovery takes shape, and as governments and businesses in the MENA region proceed with IT modernization efforts and application transformation, high-growth vertical segments like education and healthcare will continue to remain the fastest growth areas for EAS solutions, while other segments that saw a marked reduction in spending in 2009, like business services and telecommunications, will forge ahead with EAS investments. 

Process and discrete manufacturing will continue to be critical drivers of EAS spending, but their share of the total market will gradually slow through 2014, while the finance vertical will rebound strongly from the pounding it took during the crisis. Saudi Arabia will remain the MENA region's largest EAS market, and will also be its fastest growing, while Egypt is also tipped for strong growth as large numbers of businesses migrate to modern EAS suites from the installed base of legacy applications. 

Global giants SAP, Oracle, and Microsoft Dynamics dominate the MENA region's EAS market. However, the leading vendors must not rest on their laurels as businesses and governments begin to re-examine their wait-and-see approach to investing in innovative IT applications. "Customers are no longer content with just implementing solutions from global vendors," says Daryani. "They also want to see the quantifiable value they derive from adopting such solutions.

As such, vendors need to bring in world-class expertise and resources if they are to be seen as long-term players in local markets. Vendors that can both demonstrate their commitment to individual country markets and clearly convey the value of their solutions to customers stand to gain market share." 

IDC's Arab Middle East and North Africa Enterprise Application Software 2010-2014 Forecast and 2009 Vendor Shares (IDC #ZR01S) provides a detailed overview of the region's market for integrated EAS suites. It includes detailed qualitative and quantitative information, analysis, and forecasts that help vendors answer key questions regarding market size, segmentation, market shares, and major economic and political factors affecting the Arab MENA EAS market.

Via: AMEinfo.com

Friday, January 14, 2011

Google Acquires eBook Technologies

By Thomas Claburn ,  InformationWeek 
January 13, 2011 02:23 PM

The deal strengthens Google's digital content distribution capabilities and diminishes its vulnerability to potential patent lawsuits. 

Having acquired 25 companies in 2010, Google is continuing its buying spree in the new year with the purchase of eBook Technologies, a supplier of e-book reading devices and content distribution technology.

"eBook Technologies, Inc. is excited to announce that we have been acquired by Google," said the La Jolla, Calif.-based company on its Web site. "Working together with Google will further our commitment to providing a first-class reading experience on emerging tablets, e-readers and other portable devices."

No price for the deal was disclosed.


Google declined to provide specific details about its future product plans. "We are happy to welcome eBook Technologies' team to Google," said a spokesperson in an e-mailed statement. "Together, we hope to deliver richer reading experiences on tablets, electronic readers and other portable devices."

In December, Google launched its digital bookselling platform,Google eBooks, the culmination of years of legal wrangling and book scanning. Having entered into competition with Amazon and Apple in the process, Google has tried to differentiate itself by characterizing its ecosystem as more open than what's offered by its rivals.

"Open" however doesn't mean open in the sense of content without digital locks. It means open in the sense of allowing partners to have a meaningful role. In fact, Google's interest in eBook Technologies appears to be in securing e-book content and protecting itself against potential patent lawsuits.

eBook Technologies licenses e-book technology from companies hailing from dot com boom at the turn of the second millennium: SoftBook Press, NuvoMedia, and Gemstar. It also appears to have rights related to some relevant patents, such as one titled "Electronic Display Device Event Tracking," which lists eBook Technologies co-founder and president Garth Conboy among the inventors.

Indeed, the company boasts about its intellectual property portfolio on its Web site, or at least it did until these pages were removed in conjunction with the acquisition. "Patented areas of the eBook technology suite cover the unique designs, features and functions of the entire eBook publishing system," the company states on its old Web site. "Intellectual property includes: the eBook system and features, cryptography, user interface elements, industrial design and manufacturing processes."

Google's interest in eBook Technologies may also have an enterprise angle: eBook Technologies has developed a comprehensive network architecture for e-book sales and distribution that includes a component called eBook Express Manager. This is an application that "allows enterprise customers to centrally manage the delivery, access, audit, and updating of enterprise content for groups of e-book users."

Read More Via: InformationWeek 

Enterprise Trends: Contrarians and Other Wise Forecasters


The gradual upturn from the worst economic conditions in decades is reason for hope. A growing economy coupled with continued adoption of enterprise software, in spite of the tough economic climate, keep me tuned to what is transpiring in this industry. Rather than being cajoled into believing that "search" has become commodity software, which it hasn't, I want to comment on the wisdom of Jill Dyché and her Anti-predictions for 2011 in a recent Information Management Blog. There are important lessons here for enterprise search professionals, whether you have already implemented or plan to soon.

Taking her points out of order, I offer a bit of commentary on those that have a direct relationship to enterprise search. Based on past experience, Ms. Dyché predicts some negative outcomes but with a clear challenge for readers to prove her wrong. As noted, enterprise search offers some solutions to meet the challenges:

  1. No one will be willing to shine a bright light on the fact that the data on their enterprise data warehouse isn't integrated. It isn't just the data warehouse that lacks integration among assets, but among all applications housing critical structured and unstructured content. This does not have to be the case. Several state-of-the-art enterprise search products that are not tied to a specific platform or suite of products do a fine job of federating indexing of disparate content repositories. In a matter of weeks or few months, a search solution can be deployed to crawl, index and search multiple sources of content. Furthermore, newer search applications are being offered for pre-purchase testing for out-of-the-box suitability in pilot or proof-of-concept (POC) projects. Organizations that are serious about integrating content silos have no excuse for not taking advantage of easier to deploy search products.

  2. Even if they are presented with proof of value, management will be reluctant to invest in data governance. Combat this entrenched bias with a strategy to overcome lack of governance; a cost cutting argument is unlikely to change minds. However, risk is an argument that will resonate, particularly when bolstered with examples. Include instances when customers were lost due to poor performance or failure to deliver adequate support services, sales were lost because answers to qualifying questions could not be answered or were not timely, legal or contract issues could not be defended due to inaccessibility of critical supporting documents, or when maintenance revenue was lost due to incomplete, inaccurate or late renewal information getting out to clients. One simple example is the consequences of not sustaining a concordance of customer name, contact, and address changes. The inability of content repositories to talk to each other or aggregate related information in a search because a Customer labeled as Marion University at one address is the same as the Customer labeled University of Marion at another address will be embarrassing in communications and, even worse, costly. Governance of processes like naming conventions and standardized labeling enhances the value and performance of every enterprise system including search.

  3. Executives won't approve new master data management or business intelligence funding without an ROI analysis. This ties in with the first item because many enterprise search applications include excellent tools for performing business intelligence, analytics, and advanced functions to track and evaluate content resource use. The latter is an excellent way to understand who is searching, for what types of data, and the language used to search. These supporting functions are being built into applications for enterprise search and do not add additional cost to product licenses or implementation. Look for enterprise search applications that are delivered with tools that can be employed on an ad hoc basis by any business manager.

  4. Developers won't track their time in any meaningful way. This is probably true because many managers are poorly equipped to evaluate what goes into software development. However, in this era of adoption of open source, particularly for enterprise search, organizations that commit to using Lucene or Solr (open source search) must be clear on the cost of building these tools into functioning systems for their specialized purposes. Whether development will be done internally or by a third party, it is essential to place strong boundaries around each project and deployment, with specifications that stage development, milestones and change orders. "Free" open source software is not free or even cost effective when an open meter for "time and materials" exists.

  5. Companies that don't characteristically invest in IT infrastructure won't change any time soon. So, the silo-ed projects will beget more silo-ed data...Because the adoption rate for new content management applications is so high, and the ease for deploying them encourages replication like rabbits, it is probably futile to try to staunch their proliferation. This is an important area for governance to be employed, to detect redundancy, perform analytics across silos, and call attention to obvious waste and duplication of content and effort. Newer search applications that can crawl and index a multitude of formats and repositories will easily support efforts to monitor and evaluate what is being discovered in search results. Given a little encouragement to report redundancy and replicated content, every user becomes a governor over waste. Play on the natural inclination for people to complain when they feel overwhelmed by messy search results, by setting up a simple (click a button) reporting mechanism to automatically issue a report or set a flag in a log file when a search reveals a problem.

  6. It is time to stop treating enterprise search like a failed experiment and instead, leverage it to address some long-standing technology elephants roaming around our enterprises.


    To follow other search trends for the coming year, you may want to attend a forthcoming webinar, 11 Trends in Enterprise Search for 2011, which I will be moderating on January 25th. These two blogs also have interesting perspectives on what is in store for enterprise applications: CSI Info-Mgmt: Profiling Predictors 2011, by Jim Ericson and The Hottest BPM Trends You Must Embrace In 2011!, by Clay Richardson. Also, some of Ms. Dyché's commentary aligns nicely with "best practices" offered in this recent beacon, Establishing a Successful Enterprise Search Program: Five Best Practices

    Via: http://gilbane.com/search_blog/2011/01/enterprise_trends_contrarians_and_other_wise_forecasters.html#ixzz1B08RuXOW

Wednesday, January 12, 2011

Epic Media Group Teams with LucidMedia #semantic

Jan 11, 2011 (Close-Up Media via COMTEX) --

Epic Media Group, a privately held global digital marketing solutions company and the operator of Traffic Marketplace, announced that it has finalized a strategic technology partnership with LucidMedia, a digital advertising management platform, that will allow Epic Media Group to deploy interactive advertising campaigns leveraging the LucidMedia demand-side digital advertising management platform.

The announcement comes on the heels of months of collaboration between the two companies.

The Company said the new partnership will allow Epic to add LucidMedia's Real Time Bidding (RTB) technology and semantic contextual targeting to its suite of services and capabilities. Leveraging LucidMedia's platform, Epic will immediately bolster its ability to help agencies and advertisers more intelligently and cost effectively reach consumers across multiple pools of inventory via Real Time Bidding. Epic will also utilize RTB and semantic contextual targeting capabilities both within its own network of inventory, as well as across every major exchange and Supply-Side Optimizer such as AdMeld, PubMatic and Rubicon, in a fully integrated approach to campaign delivery.

"Epic's deep expertise in managing a full spectrum of advertising services, combined with LucidMedia's transparent and comprehensive demand-side RTB management platform as well as its advanced semantic-based contextual targeting engine, is a powerful combination," said Don Mathis, President and Co-CEO of Epic Media Group. "This technology partnership will further unlock massive scale, allowing Epic to enhance our suite of services and improve the performance and efficiency of our interactive solutions on behalf of our clients."

"After extensive testing on both sides, Epic has recognized that the depth and breadth of LucidMedia's platform capabilities can significantly advance their ability to get maximum value for their clients. Epic has confirmed that our platform will not only complement but also enhance its existing platform," said Ajay Sravanapudi, President and CEO of LucidMedia Networks, Inc. "We have spent a decade building and fine-tuning our demand-side platform for scalable ad operations and we have been working closely with Epic to integrate our capabilities in a way that will give their clients the flexibility and access they need to succeed online."

More information:



Via: http://www.tradingmarkets.com/news/stock-alert/epmi_epic-media-group-teams-with-lucidmedia-1414034.html

2011 Digital Trends #6 - Data and artificial intelligence "algorithm"

January 2011

This is a series of posts on my take on the 11 digital trends for 2011. This sixth trend is on data and AI. The trends are not in order of importance.

In the January 2011 edition of Wired Magazine, the feature The AI Revolution Is On says “By using probability-based algorithms to derive meaning from huge amounts of data, researchers discovered that they didn’t need to teach a computer how to accomplish a task; they could just show it what people did and let the machine figure out how to emulate that behavior under similar circumstances.”

On the web, artificial intelligence is already part of many websites we visit daily: Google’s search, Facebook recommendation of friends and the top news feed, Amazon’s recommendation engine, Pandora’s and Last.fm’s recommendation engine. These AI (some call it algorithm, which I think isn’t correct) derive their “intelligence” from massive amount of data generated by the huge user base, rather than mimicking human intelligence.

AI is not analytics like Google Analytics or Facebook Insights. The goal of analytics is to offer you data for analysis and conclusions, while AI is to offer your users something actionable or tangible like a recommendation.

How is this relevant to marketing? Recommendation and personalization engine is the first in mind. However, it is clearly very costly and difficult, and you may not have a huge data set to begin with. I think the way to go is to rely on third party services, via API as discussed on Trend #5.

The Facebook Recommendations plugin is one of the first to offer such service. It uses data from user interaction with the target website and the user’s friends’ interaction. I believe it takes into account the user Social Graph as well.

Google has a whole library of APIs for its various services, and a few have AI features. Use at your own risk though. Surprisingly I find Google APIs quite unstable because there’s so many and they seem to be created by different teams (they are, actually). If you are using Google AdSense, the AdSence API allows you to generate codes that display contextual ads. The YouTube Data API has some form of recommendation API as well that you can query for “related videos”, and the Google Language API deals with translation.

My 2011 Watch List for data and AI:

  • Facebook Social Graph – I hope that the next step is to provide a set of APIs that give third parties AI capabilities such as recommendations based on the data from Facebook itself and the third party sites.
  • Foursquare - I am interested to see how Foursquare’s location data can be used for recommendations and personalization beyond the “Trending Now” feature. Or does it need to be integrated with Facebook Social Graph?

Via: http://blog.campaignasia.com/yonghwee/2011-digital-trends-6-data-and-artificial-intelligence-algorithm/

Is there anything "artificial" about "artificial artificial intelligence"?

Artificial Artificial Intelligence is a term coined by Jeff Bezos for collectively mining our electronic data to provide very intelligent-like responses to questions that are otherwise hard to answer.  It's not exactly AI because it takes our collective human actions into account.  It's been implemented by the Dragon Dictation software people in their iphone version to better their database of human speech-to-text translation (they record which word you choose when you make the ambiguous sounds you make, making their guesses ever-better on average).

I am not convinced that Artificial Artificial Intelligence is other than a new, unfamiliar form of intelligence.  Let us suppose that in the future, everyone in the world is on the internet with at least DSL speeds.  In this case, there is a sense in which everything we store on a computer represents all of our collective knowledge.

And so, with access to all of our comments, all of our search-correlation patterns, etc, and with sufficient processing speed, I do not understand what makes google different from something needing to be called Mr. Google.  It reads my mind literally as I write.

What could be more human than to try to answer all human queries?

Via: http://www.quora.com/Is-there-anything-artificial-about-artificial-artificial-intelligence

Mining Wikipedia with Hadoop and Pig for Natural Language Processing

The context: semantic knowledge extraction from unstructured text

In a previous post we introduced fise an open source semantic engine now being incubated at the ASF under the new name: Apache Stanbol. Here is a 4 minute demo that explains how such a semantic engine can be used by a document management system such as Nuxeo DM to tag documents with entities instead of potentially ambiguous words:
The problem with the current implementation, which is based on OpenNLP, is the lack of readily available statistical models for Named Entity Recognition in languages such as French. Furthermore, the existing models are restricted to the detection of few entity classes (right now the English models can detect people, place and organization names).
To build such a model, developers have to teach or train the system by applying a machine learning algorithm on an annotated corpus of data. It is very easy to write such a corpus forOpenNLP: just pass it a text file with one sentence per line, where entity occurrences are located using the START and END tags, for instance:
<START:person> Olivier Grisel <END> is working on the <START:software> Stanbol <END> project .
The only slight problem is to somehow convince someone to spend hours manually annotating hundred of thousands of sentences from text on various topics such as business, famous people, sports, science, literature, history... without making too many mistakes.

Mining Wikipedia in the cloud

Instead manually of annotating text, one should try to benefit from an existing annotated and publicly available text corpus that deals with a wide range of topics, namely Wikipedia.
Our approach is rather simple: the text body of Wikipedia articles is rich in internal links pointing to other Wikipedia articles. Some of those articles are referring to the entity classes we are interested in (e.g. person, countries, cities, ...). Hence we just need to find a way to convert those links into entity class annotations on text sentences (without the Wikimarkup formatting syntax).
To find the type of the entity described by a given Wikipedia article, one can use the category information as described in this paper by Alexander E. Richman and Patrick Schone . Alternatively we can use the semi-structured information available in the Wikipedia infoboxes. We decided to go for the latter by reusing the work done by the DBpedia project:
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. We hope this will make it easier for the amazing amount of information in Wikipedia to be used in new and interesting ways, and that it might inspire new mechanisms for navigating, linking and improving the encyclopedia itself.
More specifically we will use a subset of the DBpedia RDF dumps:
  • instance_types_en.nt to relate a DBpedia entity ID to its entity class ID
  • page_links_en.nt to relate a Wikipedia article entity ID to its entity class ID
The mapping from a Wikipedia URL to a DBpedia entity ID is also available in 12 languages (en, de, fr, pl, ja, it, nl, es, pt, ru, sv, zh) which should allow us to reuse the same program to build statistical models for all of them.
Hence to summarize, we want a program that will:
  1. parse the Wikimarkup of a Wikipedia dump to extract unformatted text body along with the internal wikilink position information;
  2. for each link target, find the DBpedia ID of the entity if available (this is the equivalent of a JOIN operation in SQL);
  3. for each DBpedia entity ID find the entity class ID (this is another JOIN);
  4. convert the result to OpenNLP formatted files with entity class information.
In order to implement this, we started a new project called pignlproc, licensed under ASL2. The source code is available as a github repository. pignlproc uses Apache Hadoop for distributed processing, Apache Pig for high level Hadoop scripting and Apache Whirr to deploy and manage Hadoop on a cluster of tens of virtual machines on the Amazon EC2cloud infrastructure (you can also run it locally on a single machine of course).
Detailed instructions on how to run this yourself are available in the README.md and theonline wiki.

Parsing the wikimarkup

The script performs the first step of the program, namely parsing & cleaning up the wikimarkup and extracting the sentences and link positions. This script uses somepignlproc specific User Defined Functions written in java to parse the XML dump, parse the Wikimarkup syntax using the bliki wiki parser and detect sentence boundaries using OpenNLP - all of this while propagating the link positioning information.
-- Register the project jar to use the custom loaders and UDFsREGISTER $PIGNLPROC_JARparsed = LOAD '$INPUT'  USING pignlproc.storage.ParsingWikipediaLoader('$LANG')  AS (title, wikiuri, text, redirect, links, headers, paragraphs);-- filter and project as early as possiblenoredirect = FILTER parsed by redirect IS NULL;projected = FOREACH noredirect GENERATE title, text, links, paragraphs;-- Extract the sentence contexts of the links respecting the paragraph-- boundariessentences = FOREACH projected  GENERATE title, flatten(pignlproc.evaluation.SentencesWithLink(    text, links, paragraphs));stored = FOREACH sentences  GENERATE title, sentenceOrder, linkTarget, linkBegin, linkEnd, sentence;-- Ensure ordering for fast merge with type info laterordered = ORDER stored BY linkTarget ASC, title ASC, sentenceOrder ASC;STORE ordered INTO '$OUTPUT/$LANG/sentences_with_links';
We store the intermediate results on HDFS for later reuse by the last script in step 4.

Extracting entity class information from DBpedia

The second script is doing step 2 and step 3 (the joins) on the DBpedia dumps. This script also uses some pignlproc specific tools to quickly parse NT triples while filtering out those that are not interesting:
-- Load wikipedia, instance types and redirects from DBpedia dumpswikipedia_links = LOAD '$INPUT/wikipedia_links_$LANG.nt' USING pignlproc.storage.UriUriNTriplesLoader(    'http://xmlns.com/foaf/0.1/primaryTopic')  AS (wikiuri: chararray, dburi: chararray);wikipedia_links2 =  FILTER wikipedia_links BY wikiuri IS NOT NULL;-- Load DBpedia type data and filter out the overly generic owl:Thing typeinstance_types =  LOAD '$INPUT/instance_types_en.nt'  USING pignlproc.storage.UriUriNTriplesLoader(    'http://www.w3.org/1999/02/22-rdf-syntax-ns#type')  AS (dburi: chararray, type: chararray);instance_types_no_thing =  FILTER instance_types BY type NEQ 'http://www.w3.org/2002/07/owl#Thing';joined = JOIN instance_types_no_thing BY dburi, wikipedia_links2 BY dburi;projected = FOREACH joined GENERATE wikiuri, type;-- Ensure ordering for fast merge with sentence linksordered = ORDER projected BY wikiuri ASC, type ASC;STORE ordered INTO '$OUTPUT/$LANG/wikiuri_to_types';
Again we store the intermediate results on HDFS for later reuse by other scripts.

Merging and converting to the OpenNLP annotation format

Finally the last script takes as input the previously generated files and an additional mapping from DBpedia class names to their OpenNLP counterpart, for instance:
http://dbpedia.org/ontology/Person  personhttp://dbpedia.org/ontology/Place   locationhttp://dbpedia.org/ontology/Organisation    organizationhttp://dbpedia.org/ontology/Album   albumhttp://dbpedia.org/ontology/Film    moviehttp://dbpedia.org/ontology/Book    bookhttp://dbpedia.org/ontology/Software    softwarehttp://dbpedia.org/ontology/Drug    drug
The PIG script to do the final joins and conversion to the OpenNLP output format is the following. Here again pignlproc provides some UDFs for converting the pig tuple & bag representation to the serialized format accepted by OpenNLP:
SET default_parallel 40REGISTER $PIGNLPROC_JAR-- use the english tokenizer for other European languages as wellDEFINE opennlp_merge pignlproc.evaluation.MergeAsOpenNLPAnnotatedText('en');sentences = LOAD '$INPUT/$LANG/sentences_with_links'  AS (title: chararray, sentenceOrder: int, linkTarget: chararray,      linkBegin: int, linkEnd: int, sentence: chararray);wikiuri_types = LOAD '$INPUT/$LANG/wikiuri_to_types'  AS (wikiuri: chararray, typeuri: chararray);-- load the type mapping from DBpedia type URI to OpenNLP type nametype_names = LOAD '$TYPE_NAMES' AS (typeuri: chararray, typename: chararray);-- Perform successive joins to find the OpenNLP typename of the linkTargetjoined = JOIN wikiuri_types BY typeuri, type_names BY typeuri USING 'replicated';joined_projected = FOREACH joined GENERATE wikiuri, typename;joined2 = JOIN joined_projected BY wikiuri, sentences BY linkTarget;result = FOREACH joined2 GENERATE title, sentenceOrder, typename, linkBegin, linkEnd, sentence;-- Reorder and group by article title and sentence orderordered = ORDER result BY title ASC, sentenceOrder ASC;grouped = GROUP ordered BY (title, sentenceOrder);-- Convert to the OpenNLP training formatopennlp_corpus =FOREACH groupedGENERATE opennlp_merge(   ordered.sentence, ordered.linkBegin, ordered.linkEnd, ordered.typename);STORE opennlp_corpus INTO '$OUTPUT/$LANG/opennlp';
Depending the size of the corpus and the number of nodes you are using, the length of each individual job will run from a couple of minutes to a couple of hours. For instance, the first steps for parsing 3GB of Wikipedia XML chunks on 30 small EC2 instances will typically take between 5 and 10 minutes.

Some preliminary results

Here is a sample of the output on the French Wikipedia dump for location detection only:
You can replace "location" by "person" or "organization" in the previous URL for more examples. You can also replace "part-r-00000" by "part-r-000XX" to download larger chunks of the corpus. You can also replace "fr" by "en" to get English sentences.
By concatenating chunks of each corpus to into files of ~100k lines one can get reasonably sized input files for the OpenNLP command line tool:
$ opennlp TokenNameFinderTrainer -lang fr -encoding utf-8 \   -iterations 50  -type location -model fr-ner-location.bin \   -data ~/data/fr/opennlp_location/train
Here are the resulting models:
It is possible to retrain those models on a larger subset of chunks by allocating more than 2GB of heap-space to the OpenNLP CLI tool (I used version 1.5.0). To evaluate the performance of the trained models you can run the OpenNLP evaluator on a separate part of the corpus (commonly called the testing set):
$ opennlp TokenNameFinderEvaluator -encoding utf-8 \   -model fr-ner-location \   -data ~/data/fr/opennlp_location/test
The corpus is quite noisy so the performance of the trained models is not optimal (but better than nothing anyway). Here is the result of evaluations on held out chunks of the French corporas (+/- 0.02):
Performance evaluation for NER on a French extraction with 100k sentences
location 0.87 0.74 0.80
person 0.80 0.68 0.74
organization 0.80 0.65 0.72
Performance evaluation for NER on a English extraction with 100k sentences
location 0.77 0.67 0.71
person 0.80 0.70 0.75
organization 0.79 0.64 0.70

The results of this fist experiment are interesting, but lower than the state of the art, especially for the recall values. The main reason is that there are many sentences in Wikipedia that hold entities that do not carry a link.
A potential way to improve this would be to set up a sort of active learning tooling where the trained models are reused to suggest missing annotations to a human validator to be quickly accepted or rejected so as to improve the quality the corpus and then the quality of the following generation of models until the corpus reaches the quality of the fully manually annotated one.

Future work

I hope that this first experiment could convince some of you of the power of combining tools such as Pig, Hadoop and OpenNLP for batch text analytics. By advertising the project on the OpenNLP users mailing list, we already got some very positive feedback. It is very likely thatpignlproc will get contributed one way or another to the OpenNLP project.
These tools are, of course, not limited to training OpenNLP models, and it will be very easy to adapt the code of the conversion UDF to generate BIO formatted corpora to be used by other NLP libraries such as NLTK for instance.
Finally, there is no reason to limit this processing to NER corpora generation. Similar UDFs and scripts could be produced to identify text sentences that express in a natural language the relationships that link entities and that have already been extracted in a structured manner from the infoboxes by the DBpedia project.
Such a new corpus would be of great value for developing and evaluating the quality of automated entity relationships and properties extraction. Such a new extractor could potentially be based on syntactic parsers such as the one available in OpenNLP orMaltParser.


This work was funded by the Scribo and IKS R&D projects. We also would like to thank all the developers of the involved projects.

N-gram analysis of 970 microbial organisms reveals presence of biological language models

By: BMC Bioinformatics

Published: 10 January 2011


It has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as "signature-style" word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may therefore be applied to genome sequences to draw biologically relevant conclusions. Following this approach of 'biological language modeling', statistical n-gram analysis has been applied for comparative analysis of whole proteome sequences of 44 organisms. It has been shown that a few particular amino acid n-grams are found in abundance in one organism but occurring very rarely in other organisms, thereby serving as genome signatures. At that time proteomes of only 44 organisms were available, thereby limiting the generalization of this hypothesis. Today nearly 1,000 genome sequences and corresponding translated sequences are available, making it feasible to test the existence of biological language models over the evolutionary tree.


We studied whole proteome sequences of 970 microbial organisms using n-gram frequencies and cross-perplexity employing the Biological Language Modeling Toolkit and Patternix Revelio toolkit. Genus-specific signatures were observed even in a simple unigram distribution. By taking statistical n-gram model of one organism as reference and computing cross-perplexity of all other microbial proteomes with it, cross-perplexity was found to be predictive of branch distance of the phylogenetic tree. For example, a 4-gram model from proteome of Shigellae flexneri 2a, which belongs to the Gammaproteobacteria class showed a self-perplexity of 15.34 while the cross-perplexity of other organisms was in the range of 15.59 to 29.5 and was proportional to their branching distance in the evolutionary tree from S. flexneri. The organisms of this genus, which happen to be pathotypes of E.coli, also have the closest perplexity values with E. coli.


Whole proteome sequences of microbial organisms have been shown to contain particular n-gram sequences in abundance in one organism but occurring very rarely in other organisms, thereby serving as proteome signatures. Further it has also been shown that perplexity, a statistical measure of similarity of n-gram composition, can be used to predict evolutionary distance within a genus in the phylogenetic tree.

The complete article is available as a provisional PDF. The fully formatted PDF and HTML versions are in production.