The GSoC rocket launched

24 April 2012

Announcement Community ~

The Google Summer of Code 2012 has officially started! The results have been announced by Google. Congratulations to the students who join the Gephi project:

  • Eduardo Espinoza – Legend Module in Preview
  • Romain Yon – Cloud Gephi
  • Taras Klaskovsky – Force Directed Edge Bundling algorithm
  • Vikash Anand – Statistics Reports and HTML5 Charts
  • Min WU – Interconnect Graph Streaming API and GraphStream

You put a lot of attention on doing the bests applications and demonstrate great motivation in addition to strong technical skills. We are very excited to work with you guys!

This year we are also honored to count on world-class researchers as mentors: Yoann Pigné is an Assistant Professor at the university of Le Havre, France, and is a leader of the GraphStream project. We will co-mentor the Graph Streaming project with André Panisson, our former Google Summer of Code student. André got his Ph.D. recently, and authored the video of the Egyptian Revolution on Twitter. Finally, Christian Tominski, who mentored the Preview refactoring last year, will mentor the Force Directed Edge Bundling project. He is a Lecturer and Researcher at the Institute for Computer Science at the University of Rostock. He has authored several articles in the field of information visualization.

Former Google Summer of Code students will also mentor and advise students, like Luiz Ribeiro.

The Summer Timeline:

* Until May 21: Students get to know mentors, read documentation, get up to speed to begin working on their projects.
* May 21: Students starts to code
* July 13: Mid-term evaluation
* August 24: Final evaluation

Comment it »

Announcement Community ~

It’s a great news, Gephi has been accepted again for the Google Summer of Code. The program is the best way for students around the world to start contributing to an open-source project. Since 2009, each edition is a great success and dramatically boosted Gephi’s project development.

What is Gephi?

Networks are everywhere: email systems, financial transaction systems and gene-protein interaction networks are just a few examples. Gephi began as a university student project four years ago and has quickly become an open source software leader in the visualization and analysis of large networks. It is an important contribution to the ecosystem of tools used by researchers and big data analysts to explore and extract value from the deluge of relational data and disseminate a better understanding for people to think about a “connected” world.

Gephi is a “Photoshop” for graphs: designed to make data navigation and manipulation easy, it covers the entire process from data importing to aesthetics refinements and communication. Users interact with the visualization and manipulate structures, shapes and colors to reveal the properties of complex and messy data. The goal is to help data analysts make hypotheses and intuitively discover patterns or errors in large data collections.

Gephi’s project aims at providing the perfect tool to visualize and analyze networks. We focus on usability, performance and modularity:

  • Usability: Easy to install, an UI without scripts and real-time manipulation.
  • Performance: Visualization engine and data structures are built scalable. Supporting always-larger graphs is an endless challenge!
  • Modularity: Extensible software architecture, built on top of Netbeans Platform. Add plug-ins with ease.

Learn more about Gephi, watch Introducing Gephi 0.7, download and try it by following Quick Start Tutorial.

Gephi’s project is young, the growing community is composed of engineers and scientists involved in network science, datavis and complex networks.

List of ideas

List of ideas are availabe on our wiki. They cover various skills and level of difficulties:

* Legend moduleIntegrate a legend in the Preview module
* Flexible Table ImporterCreate a generic network creation wizard from data tables
* Cloud GephiBuild an online gallery and bring some of Gephi’s features to the cloud
* Force Directed Edge BundlingImplement Force Directed Edge Bundling algorithm
* Statistics Reports and HTML5 ChartsImprove statistic report and port existing charts to HTML5+Javascript
* Statistics Unit TestsAdd unit tests to the statistical algorithms
* Graph StreamingImprove Graph Streaming API and interconnect GraphStream’s dynamic graph event model with Gephi

Please also propose your ideas on the forum. They will be considered and discussed by the community. Have a look at our long-term Roadmap.

Students, join the network

Students, apply now for Gephi proposals. Join us on the forum and fill in the questionnaire. Be careful, deadline for submitting proposals is April 6 (timeline)!

Hélder Suzuki, student for Gephi in 2009 and now software engineer at Google, wrote:
At Gephi students will have the opportunity to produce high impact work on a rapidly growing area and be noted for it.

View our previous Google Summer of Code projects here and read former students interviews.

Follow Gephi on Twitter

1 comment »

Community ~

This post was originally posted on the Google Open Source blog by Sébastien Heymann, co-founder of the Gephi project and Google Summer of Code administrator.

Networks are everywhere: email systems, financial transaction systems and gene-protein interaction networks are just a few examples. Gephi began as a university student project four years ago and has quickly become an open source software leader in the visualization and analysis of large networks. It is an important contribution to the ecosystem of tools used by researchers and big data analysts to explore and extract value from the deluge of relational data and disseminate a better understanding for people to think about a “connected” world.

Gephi is a “Photoshop” for such data: designed to make data navigation and manipulation easy, it covers the entire process from data importing to aesthetics refinements and communication. Users interact with the visualization and manipulate structures, shapes and colors to reveal the properties of complex and messy data. The goal is to help data analysts make hypotheses and intuitively discover patterns or errors in large data collections.

Our success was made much faster thanks to the Google Summer of Code. The timing of our acceptance into our first Google Summer of Code in 2009 was perfect: we were at the point where we could make the project really open in the way our infrastructure could scale code, and our human organization was ready to welcome contributors. Participating in the program gave us a boost of fame helping us promote the project and created an international community for Gephi.

We met many people and learned a lot, but this is the most important lesson to share: though students are paid stipends for their work during the program, money should not be the first incentive. To encourage students to stick with the project, we talk with each of them to find their deeper motivations in working on Gephi and try and develop a win-win situation. And it works! Many of the students continue to contribute to the project for at least a few months after the end of the Google Summer of Code program, and others have gone on to become members of our team.

We recognize this long-term investment by promoting their work, like André Panisson who released a plug-in in 2010, which connects Gephi to a graph stream and visualizes it in real-time. André made this amazing video of the Egyptian Revolution on Twitter, when he monitored the hashtag #jan25. More recently, Martin Škurla presented his work at FOSDEM 2012 and talked about his plug-in which connects Gephi to the graph database Neo4j. He started his project during the Google Summer of Code 2010 and continued his work until the release. We really appreciated the effort, so the Gephi Consortium and Neo Technologies Inc. paid his expenses to attend the conference. Finally, I must talk about Eduardo Ramos, who we rejected as a student two years ago for Google Summer of Code but who was so motivated that he decided to contribute to Gephi anyway, becoming one of the project leaders, a Google Summer of Code mentor… and a friend!

To learn more about Gephi, watch our madness screencast and view our previous Google Summer of Code projects here. Want to apply for Gephi? Join us on the forum.

Comment it »

Functionality ~

My name is Ernesto Aneiros and during this Google Summer of Code I am working on the Attributes Disk Store.

The problem

In Gephi, Attributes are the data that is associated with nodes and edges. As graphs grow larger and larger, attributes occupy more memory even though many times they are not essential to the end-user when he is only applying transformations or algorithms to the graphs. These attributes can be of different types, from simple Java primitives (byte, char, int, String, etc) to Gephi’s internal data types (lists of primitives or versioned data). The idea for the project was to have a combined memory/disk cache system to partially off-load these attributes to disk. The system should have a well-designed cache system to handle heavy read access on the most-accessed elements.

The Solution (1st iteration)

Lucene is one of the most popular text searching engines and a flagship Java open source project. Lucene is capable of handling and indexing millions of records while remaining performant, and when the idea for the Attributes API cache was born, Lucene was first considered for the role of the data store, and as added bonus Gephi will get full-text search capabilities with almost no extra effort. When analyzing the problem, the following criteria were developed to judge a possible data store:

  1. Reliable (resistant to corrupted disk data, failed transactions, unexpected errors)
  2. Fast
  3. Transparent (minimum complexity exposed to the end-user)

While Lucene complies with items 2 and 3, the approach when dealing with corrupted indices in Lucene is to rebuild from scratch therefore failing item 1. This doesn’t pose a problem to Lucene because in the context where it is supposed to be used (indexing of external information), input data is always available separately from the index and can be accessed if needed. In Gephi, however, this is not the case. Once Attributes are loaded from disk they remain in memory until saved back to file. If an error occurs during a disk store transaction the end-user can end up losing a day’s work, certainly not acceptable.

The Solution (2nd iteration)

After Lucene was ruled out as a contender for a data store, several options were considered, including using embedded SQL databases and using a combination of Ehcache plus BerkeleyDb. Both options bring a lot to the table and embedded databases in Java have achieved impressive results in performance when compared to other mainstream database systems (see projects H2 and HSQL for example). Ehcache + BerkeleyDb however win when complexity is considered since they introduce almost no translation layers between Gephi and the cache. Both solutions are good fits for the problem but in the end the balance tilted in favor of Ehcache + BDB because the complexity consideration.

Optimizing Ehcache and BerkeleyDB

Even though Ehcache provides a great deal of functionality and features, it was relatively easy getting up to speed with it. The documentation provided online was very complete with code samples available and detailed explanations. In almost no time an in-memory cache was up, running and being tested. Traditionally cache sizes have been specified as the amount of max elements that they can hold. In the 2.5 BETA of Ehcache a new feature was introduced that allowed sizing the caches by memory consumed instead of elements held. For our project this is a killer feature since we can now expose a single option to the user, letting him specify how much memory the cache should consume. Even though using the new feature proved a little more complicated than expected we obtained great feedback from the Ehcache community, specially from alexsnaps and Mike Allen, which helped us to solve the issues we were having.

BerkeleyDB on the other hand, is a very complex piece of software. With years of development under the belt, BDB has evolved to be a very robust and flexible database. In fact, it is so flexible that can be used as full blown database supporting queries, a simple key/value datastore or with a front-end that exposes a Java collections map that greatly simplifies its use. All of this flexibility does not come free though, configuring and optimizing BerkeleyDB requires delving into details about transactions, buffers, log file sizes and BDB internals. However the tools are there and the information provided is quite good, especially the FAQ and the optimization section.

Integration with Gephi

Since ease of use and transparency are important considerations for the end-user of Gephi, only the minimal configuration options are exposed in the preference panel of the disk cache, but an Advanced tab provides more control for those who want it.


The General settings tab, where cache can be enabled or disabled and the memory usage configured.


The advanced settings tab allows a more advanced user to configure several of BerkeleyDB’s options.

The Disk Store in Action


Memory consumption without the disk store. It reaches 400MB.


Memory consumption with the store, after load it drops below 400 MB. Note how load time increased due to disk operations, a trade-off to consider when using the store.

Known issues

The project is still in development. Being memory saving the main goal of the disk store project, results are not good enough yet because of several reasons.

While BerkeleyDB provides a very convenient way of storing bytes in disk, it is still a database oriented software and therefore it is not the most suitable solution for out project because of large memory usage to caching data, building and maintaining its index (features desirable for databases but not for this project).

Trying to reduce BerkeleyDB memory usage with its settings will produce quite different results in different systems or even in the same system. The benchmark above shows not bad results but it is not always the case. A better control of maximum heap growth can be observed but still with memory usage peaks that prevent better saving.

The conclusion is that it is a priority to replace BerkeleyDB with other disk persistence system or create one specifically designed for Gephi disk store.

It is also known that graphs with more complex data like strings or lists will always benefit more from a disk storage system than graphs with simple data like integers or booleans. An idea is to always store simple data in memory because indexing in in the disk is going to need as much memory anyway, or even more.

On the other hand, Gephi works and was designed with in-memory data structures in mind. Adding a cache/disk store to the system is bound to create integration issues with other parts of the codebase. For example the GEXF file importer tends to load large portions of the graph file to memory while parsing it, which is not so good in memory constrained environments and using the cache here will not make a difference. One of these issues is regarding the handling of data in files with .gephi format. Due to the way that .gephi files are imported, some integration problems still need to be debugged in the disk store to work properly.

Looking to the future

This GSOC project is only scratching the surface of what a memory + disk cache system can achieve. In the future BerkeleyDB could be replaced with other persistence provider, and it doesn’t necessarily has to persist locally to the disk. For example replacing BerkeleyDB with a datastore like Cassandra, or maybe some RDBMS.

Conclusion

While the Data Store API introduced by this project is still taking its first steps and can be significantly evolved, it has helped ironing out many issues and has paved the way for bigger and better improvements. Working during this summer has been a great experience and I have been able to share with great mentors like Eduardo Ramos, who knows the Gephi codebase in and out. I hope the work of all of Gephi’s GSOC’ers becomes the starting point for many new features and enhancements that the community will surely appreciate. Happy coding and see you next summer!

Comment it »

Design Functionality ~

My name is Daniel Bernardes and during this Google Summer of Code I am working on the new Timeline interface.

Dynamic graphs have been the subject of increasing interest, given their potential as a theoretical model and their promising applications. Following this trend, Gephi has incorporated tools to study dynamic networks. From a visualization perspective, a critical tool is the Timeline component, which allows users to select pertinent time intervals and display and explore the corresponding graph. The challenge concerning the timeline was twofold: redesign the component to improve user experience and add extra features and introduce an animation scheme with the possibility to export the resulting video.

Together with my mentors Cezary Bartosiak and Sébastien Heymann, we have proposed a new design for the timeline component featuring a sparkline chart in the background of the interval selection drawer (which is semi transparent): this feature will help the user to focus on particular moments of the evolution of the dynamic graph, like bursts of connections or changes in graph density or other simple graph metrics. Current metrics are the evolution of the number of nodes, the number of edges and the graph density. The sparkline chart was preferred to other chart solutions because it does not add too much visual pollution to the component and adds to the qualitative analysis. The interaction with the drawer remains globally the same of the old timeline, to guarantee a smooth transition for the user.

To implement this feature we have used the chart library JFreeChart (a library already incorporated to Gephi), customizing their XYPlot into a Sparkline chart by modifying their visual attributes. To display the Sparkine, one needs to measure the properties of the graph in several time instants of the global time frame where the dynamic graph exists. This represented a major challenge, since the original architecture did not allow the timeline component to access (and measure) the graph in particular instants of time; the solution was to introduce a slight modification to the DynamicGraph API to provide an object which gave us snapshots of the graph at given instants. Other challenges we dealt with included the automatic selection/switching of real number/time units in the timeline (depending on the nature of the graph in question) and sampling granularity of the timeline.

Another breakthrough of this project was the introduction of the timeline animation. Once the user has selected a time frame with the drawer it can make it slide as the corresponding graph is being displayed on the screen. Besides the technical aspects of interaction between the timeline and the animation controller, there were also an effort to calibrate the animation (ie, in terms of speed and frames) so it would be comfortable and meaningful for the user.

As far as the UI is concerned, the component has gained a new “Reset” button next to the play button which activates the timeline drawer and displays the chart. It also serves to reset the drawer selection to the full interval when the timeline is active. The play button gained its original function, that is, to control the animation of the timeline — instead of activating the selection.

Finally, the animation export to a video format revealed to be more tricky than expected and couldn’t be finished as planned. There were several setbacks to this feature, beginning with the selection of a convenient library to write de movie container: it turns out that the de facto options available are not fully Java-based and need an encoder working in the background. The best alternative I found was Xuggler, which is based on ffmpeg. Also, obtaining screen captures of the graph to were a little bit tricky so I have exported SVG images from the graph corresponding to each frame, converted them to jpeg and than encoded them though Xuggler to a video format. As one might expect, this solution is not very efficient in terms of time, so Mathieu Bastien and my mentors suggested me to wait for the new features from the new Visualization API that would make this process simpler.

In addition to current bugfixes and minor improvements concerning the timeline and the animation, the movie export remains the the next big step to close this project. If you have questions or suggestion, please do not hesitate! The new timeline will be available in the next release of Gephi.

DB

4 comments »