Announcement Community ~ gsoc
It’s a great news, Gephi has been accepted again for the Google Summer of Code for the 5th year! The program is the best way for students around the world to start contributing to an open-source project. Since 2009, each edition is a great success and dramatically boosted Gephi’s project development.
What is Gephi?
Networks are everywhere: email systems, financial transaction systems and gene-protein interaction networks are just a few examples. Gephi began as a university student project four years ago and has quickly become an open source software leader in the visualization and analysis of large networks. It is an important contribution to the ecosystem of tools used by researchers and big data analysts to explore and extract value from the deluge of relational data and disseminate a better understanding for people to think about a “connected” world.
Gephi is a “Photoshop” for graphs: designed to make data navigation and manipulation easy, it covers the entire process from data importing to aesthetics refinements and communication. Users interact with the visualization and manipulate structures, shapes and colors to reveal the properties of complex and messy data. The goal is to help data analysts make hypotheses and intuitively discover patterns or errors in large data collections.
Gephi’s project aims at providing the perfect tool to visualize and analyze networks. We focus on usability, performance and modularity:
- Usability: Easy to install, an UI without scripts and real-time manipulation.
- Performance: Visualization engine and data structures are built scalable. Supporting always-larger graphs is an endless challenge!
- Modularity: Extensible software architecture, built on top of Netbeans Platform. Add plug-ins with ease.
Learn more about Gephi, watch Introducing Gephi 0.7, download and try it by following Quick Start Tutorial.
Gephi’s project is young, the growing community is composed of engineers and scientists involved in network science, datavis and complex networks.
List of ideas
List of ideas are availabe on our wiki. They cover various skills and level of difficulties:
* Completing Legend module – Complete the Legend module, which was started in last year GSoC.
* GraphStore benchmark and tuning – Optimize and tune GraphStore based on a serie of new well-defined benchmarks.
Please also propose your ideas on the forum. They will be considered and discussed by the community. Have a look at our long-term Roadmap.
Students, join the network
Students, apply now for Gephi proposals. Join us on the forum and fill in the questionnaire. Be careful, deadline for submitting proposals is May 3 (timeline)!
Hélder Suzuki, student for Gephi in 2009 and now software engineer at Google, wrote:
At Gephi students will have the opportunity to produce high impact work on a rapidly growing area and be noted for it.
View our previous Google Summer of Code projects here and read former students interviews.
Follow Gephi on Twitter
Announcement Design ~ 0.9 architecture code roadmap
This is the first article about the future Gephi 0.9 version. Our objective is to prepare the ground for a future 1.0 release and focus on solving some of the most difficult problems. It all starts with the core of Gephi and we’re giving today a preview of the upcoming changes in that area. In fact, we’re rewriting the core modules from scratch to improve performance, stability and add new features. The core modules represent and store the graph and attributes in memory so it’s available to the rest of the application. Rewriting Gephi’s core is like replacing the engine of a truck and involves adapting a lot of interconnected pieces. Gephi’s current graph structure engine was designed in 2009 and didn’t change much in multiple releases. Although it’s working, it doesn’t have the level of quality we want for Gephi 1.0 and needs to be overhauled. The aim is to complete the new implementation and integrate it in the 0.9 version.
In November 2012, we started to develop a completely new in-memory graph structure implementation for Gephi based on what we’ve learnt over the years and our desire to design a solution that will last. The project code-name is GraphStore and we focus on four main things:
- Performance: The graph structure is so important to the rest of the application that is has to be fast and memory efficient.
- Stability: The new code will be the most heavily unit-tested in the history of Gephi.
- Simplicity: The Graph API should be documented and easy to use for developers.
- Openness: If possible, we want GraphStore to be used in other projects and keep the code free of Gephi-specific concepts.
Gephi is known to use a large amount of memory, especially for very large networks. We want to challenge ourselves and tackle this issue by redesigning the way graphs are encoded and stored. Besides memory usage, we carefully analyzed possible solutions to improve read/write performance and optimize the throughput. Stability and simplicity are like food and shelter, and whatever we try to do at Gephi should be simple to use and stable. As we’re going towards a 1.0 version, we’re putting more and more efforts to testing and code quality.
Since November 2012, we have been working on GraphStore separately from Gephi’s codebase and will start the integration fairly soon. The Graph API is very similar to the existing API. However, it isn’t entirely compatible and several core things changed like attributes, views or dynamic networks and will require a lot of work in some modules. On the other hand, because the GraphStore code is decoupled, it could be leveraged in other projects. For instance, it could serve as a Blueprints implementation as an alternative to TinkerGraph.
A graph (also called network) is a pair of a set of nodes and a set of edges. Edges can be undirected, or directed if the direction of the relation matters. Edges may also have weights to represent a value attached to the edges, like the strength of a connection or the flow capacity. Edges may also point to the same node (i.e. self-loops). Gephi currently supports these features, but they are not sufficient to describe the variety of problems graphs can be helpful with. Multigraphs permit several relationships between nodes and is for instance commonly used to represent RDF graphs. Multigraphs with properties (i.e. ability to attach any property to nodes and edges) have recently become the standard representation for graph databases.
Multigraph with labeled edges (source: TinkerPop)
The next version of Gephi will support multigraphs and therefore allow multiple edges between nodes to be imported. The rollout will be done in two phases. The first phase is to allow this new type of graph to be imported, filtered and exported. We will update the importers and add new options to support these graphs. The second phase is to update the visualization and the way multiple edges between nodes look like.
Since the 0.7 version released in 2009, Gephi has supported hierarchical graphs. Hierarchical graphs let the user group or ungroup nodes so it forms a tree. Nodes which contain other nodes are named meta-nodes and edges are collapsed into meta-edges. Groups obtained from clustering algorithm (e.g. modularity) could also easily be collapsed into meta-nodes in order to study the network at a higher level. We initially recognized the potential of this idea for network analysis and developed a hierarchy-enabled data structure. However, we realized we didn’t completely fulfill the vision by not providing all the tools to fully explore and manage hierarchical networks. Although the data structure allows it, the software still lacks many features to really make hierarchical networks explorable.
Recently, we are more focused on networks over time and plan to continue to do so. In the past years, users have shown steady and continuous interest in dynamic networks and we haven’t really seen a strong interest in hierarchical networks. Therefore, we propose to remove this feature from next releases. On the developer side, cutting this feature will greatly simplify the code and improve performance.
Networks that change over time are some of the most interesting to visualize and analyze. We have heavily invested in supporting this type of network, for instance by developing the Timeline component. However, dynamic graph support was added after the current graph structure implementation was conceived and therefore remains suboptimal and difficult to scale. Now that we have enough hindsight, we can rethink how this should be done and make it simpler.
One pain point is the way we decided to represent the time. Essentially, there are two ways to represent time for a particular node in a graph: timestamps or intervals. Timestamps are a list of points where the particular nodes exist and intervals have a beginning and an end. For multiple reasons, we thought intervals would be easier to manipulate and more efficient than a (possibly very large) set of timestamps. By talking to our users, we found that intervals are rarely used in real-world data. On the code side, we also found that it makes things much more complex and not that efficient at the end.
In future versions, we’ll remove support for intervals and add timestamps instead. We considered supporting both intervals and timestamps but decided that it would add too much complexity and confusion.
Graph structure internals
Graph structures design is an interesting problem to solve. The objective is quite simple, yet challenging: how to best represent an interconnected graph so it’s fast to query and compact in space? Also, how to keep it simple and serve a large number of features at the same time?
Our goal is to develop a thread-safe, in-memory graph structure implementation in Java suitable for real-time analysis. You may ask how this differs from a graph database or a distributed graph analysis package. In a few words, one can say the requirements are quite different.
Graph databases like Neo4j, OrientDB or Titan store the graph on local disk or in a cluster and are optimized for large graphs and large number of concurrent users. Typically, the networks are much larger than what can fit in memory and these databases mostly focus on answering traversal queries. In the environment where graph databases operate most of the needs can be converted in some sort of traversal query (e.g. friends of X, tweets of Y). Traversal queries are also the reason why graph databases scale to billions of nodes. Indeed, for each traversal, only a subset of the graph is accessed. This is quite different from Gephi, which by its nature of being an analysis software needs to access the complete graph. For instance, when a layout is running Gephi needs to read the X,Y position of each node as quickly as possible. Although reading from the disk can be very quick as well (e.g. GraphChi), it’s limited to sequential access and things become more complex that way.
Because of the real-time requirements, we want to keep our graph data in memory accessible at all time. However, we want to make it easy to connect to external data sources, and graph databases in particular.
In computer science, overhead is any combination of excess or indirect computation time, memory, bandwidth, or other resources that are required to attain a particular goal.
GraphStore heavily relies on Java primitives, arrays and efficient collections library like fastutil. We are reducing overhead by simply avoiding using too many Java objects, which are very costly. Instead of using maps, trees or lists, Nodes and Edges are stored in large arrays which can be dynamically resized in blocks. For instance, iterating over the graph should be extremely fast because the CPU caches array blocks. This may sounds obvious but performance optimizations are tricky in Java because of the JVM and the uncertainty of what makes a difference and what doesn’t. In his “Effective Java” book, Joshua Bloch writes “Don’t guess, measure” and that’s still true today. For our project, we rely on well-defined micro-benchmarks to see where the bottlenecks are and how to make our data-structure more cache-friendly and more compact in memory. When the graph contains millions of edges, every byte saved per edge can make a large difference at the end.
In terms of speed, we focused on optimizing the most common operations, which are iterating over all the elements and consult nodes’ neighbors. Typically, a layout algorithm needs to read the neighbors of every node at each iteration. Neighbors can’t simply be an unsorted list because of the removal complexity: to remove a node, you need to know where it is. The current Gephi graph structure uses a binary tree to store the node’s neighbors. Although the complexity is logarithmic, every node in the tree takes extra memory and logarithmic complexity is still suboptimal. After isolating the problem in a benchmark, we found that using a double linked-list is the best solution for our requirements and achieves a O(1) complexity, as it fulfills both a quick iteration and quick update. Here is a snapshot of the solution:
GraphStore's edge storage schema (simplified)
Every edge has 4 integer pointers to the next in/out predecessor and successors and a separate dictionary would help to find the right edge based on the source and destination pointers. Each node has a pointer to the first edge in the linked list (i.e the head). Node ids are integers (32 bits) so one can easily create a long->Edge dictionary by encoding the source and destination node into a single long number (64 bits). The diagram intentionally leaves out the multigraph support for simplicity. In reality, nodes can have multiple head pointers, one for each edge type. Each edge type is represented by a integer index.
Views are one of the most useful aspects of Gephi’s graph structure and are mainly used behind the scenes in the Filter module. A view is a graph subset (i.e. a subgraph) which remains connected to the main structure, so if a node is removed from the graph, it’s removed from the views as well. For instance, when users create a ‘Degree Filter’, Gephi creates a view and removes all the nodes which don’t fulfill the degree threshold. Multiple views can co-exist at any time in the graph structure. In the current graph structure, a node tree complete copy is done for every view and we found that this can be very inefficient.
In the new version, the way views are implemented is very different and should yield to better performance. Instead of doing a copy of the nodes, we maintain bit-vectors for nodes and edges. Because these elements are stored in large arrays with a unique identifier, it’s easy to create and maintain a bit-vector. When developers obtain the ‘Graph’ object for a particular view, the bit-vectors are used behind the scenes to adapt iterators and accessors. This solution should make filtering for large graphs much quicker. One drawback is that whereas the current implementation copies and then trims the view, GraphStore work with bit-vectors but continues to access the complete graph. In other words, if the view represents only 1% of the original graph, it still needs to iterate over the 100% to find which elements are the 1%. Even though this sounds bad, our benchmarks show it’s a very fast operation and we win overall because of the reduced overhead of duplicating the graph. Moreover, we can introduce some caching later to optimize this further.
When you’re using the Partition module in Gephi, you’re manipulating some sort of inverted index. Nodes and edges have properties like ‘gender’, ‘age’ or ‘country’, and these properties are contained within the nodes and edges objects. An index is a simple data structure which allows to retrieve the list of elements for a particular value. For instance, the partition module needs to know what is the number of ‘male’ or ‘female’ nodes for the ‘gender’ column. When the column is a number like ‘age’, it also needs to know what is the maximum and minimum value. Unlike the Ranking module and its auto-apply feature, the Partition module is not refreshed in real-time and therefore difficult to use when the graph is changing a lot. We have decided to invest in this feature for the future release and are building a real column inverted index in the graph structure. The index will simply keep track of which values exist for each column and which elements are holding this value. The index will be updated in real-time as elements are added, removed or updated.
The ability to quickly retrieve elements and counts based on specific values will be very useful in many different modules like Filters, Partition or Data Laboratory. New APIs will be added for developers to use the newly created index interface. As we’re working on attributes storage and manipulation, we’ll also merge the Attributes and Graph API because they are so interconnected that it doesn’t really make sense to have them separate. The interfaces that developers are familiar with like Table or Columns will remain the same.
In software programming, events are a common way to inform other modules that something changed. In Gephi, we also use events to convey graph updates events to inform other modules about updated nodes or edges. In the new GraphStore, we’ll stop using events to transport graph modifications because of the large overhead due to the creation of event objects. Indeed, when 10K nodes are added to the graph, the existing structure literally creates 10K event objects and puts them in a queue. Although the event queue is compressing objects of the same type, the overhead to create, queue, send and destroy large amount of small Java objects is too large.
Instead of a push model (i.e. the emitter is pushing updates), we want to rather promote a pull model (i.e. the listener pull updates from time to time) for future releases. A similar system is already in place to link the graph and the visualization module and it has been working without a glitch. We’ll develop the tools to easily calculate graph differences between a listener module and the graph structure. By removing the bottleneck, write performance should greatly improve.
As said earlier, we’ll add timestamps support to represent dynamic networks. Instead of using a time interval, a timestamp array will be associated with nodes and edges. For element (node/edge) visibility, each timestamp represents the presence of the element at that time. For example if a network snapshot is collected every month for a year, each node will have up to 12 different timestamps. The timestamp itself is a real number and can therefore represents an epoch time but also any other value in a different context. For a dynamic attribute, the time+value is simply represented as a list of (time, value) pairs.
To support the timeline and dynamic networks algorithm, we’re developing an inverted index for timestamps so we can make time filtering very quick. One good thing about intervals is that it’s very easy to know if two intervals overlaps with each other. With a flat list of timestamps, one can’t avoid to go through the entire list. The index will essentially map timestamps to the nodes and edges elements in the graph and therefore solves this issue. The Interval tree implementation which we are currently using to store intervals is based on a binary tree and is very costly in memory because of all the Java objects overhead. Using simple arrays should reduce overhead and improve performance for large dynamic networks. When computing a dynamic network algorithm (ex: Clustering coefficient over time), we’re using a sliding window over the graph so the ability to quickly filter is critical as it impacts how fast the graph refreshes.
Saving and loading the graph structure into into/from a file (or a stream) is another critical feature. When a user saves a project in Gephi, the graph data structure is serialized in XML and compressed into a .gephi file. If you worked with project files in Gephi, you may have experienced corrupted files issues or errors when loading a file. We’ve done our best to fix these problems but some still remain. We’re rethinking how this should be done in GraphStore and are making a call to rewrite the code from scratch. Our approach will rely on a lot of unit tests to make sure the code is stable so we don’t repeat the same issues in future versions. Please note that this concerns the .gephi files only and existing importers (e.g. GEXF, GraphML) will remain the same.
Concerning the GraphStore serialization, we’re abandoning XML in favor of pure byte arrays. That should yield to better performance and reduced project file size. We’ll create a custom reader for previous Gephi versions so you can still open your existing projects. Other modules like Filters or Preview will continue to use XML as it’s working just fine.
This is the first post about the Gephi 0.9 version and more will come soon. We’re excited about the current developments and hope to hear from you. Please join the gephi-dev mailing list to learn more about ongoing projects and contribute. We need your ideas!
Follow us on Twitter!
Announcement Community Website ~ marketplace plugin
Computer Science & Engineering Graduate from the Compiègne University of Technology, Guillaume Ceccarelli has been Gephi’s SysAdmin since 2008 and Web Developer since 2012. When he’s not by day working in the financial world, he freelances and stays close to ongoing entrepreneurship ventures.
Four weeks ago, we launched the Gephi Marketplace.
Haven’t taken a look yet? Didn’t know it existed? Take a few minutes to check it out now! It’s right here, waiting for you.
Why the Gephi marketplace?
For a long time, we felt like we lacked a central point of access when it came to what everyone was doing to improve or build upon Gephi. On one hand, our forums allowed for discussions to happen but they didn’t really enable anyone to share their work easily or to find out about one another. On the other hand, the wiki as well as our main site, served more as official portals to spread the word or to explore knowledge around Gephi, but they didn’t seem like a good fit to showcase the work our community was doing.
A little while ago, we decided to fix that.
The spirit behind the marketplace is simple. When we started, our idea was:
- To provide Gephi users with easy means to find and learn more about what the community was building for and around Gephi
- To give creators and people working with Gephi a way to publish their work and make them known to the entire community, without the overhead of a complex and lengthy process
This meant creating the central point for Gephi Plugins and what we call Gephi Services, which eventually became the Gephi Marketplace.
Before the marketplace existed, plugins were submitted and downloaded through the old plugin center, which required authors and the Gephi staff to go through a tedious process, creating a lot of work and unnecessary frustration for everyone. In addition, the system didn’t allow authors to update their own plugin pages, or to submit a different plugin for different versions of the Gephi application, a feature which was long requested before we finally became in a position to deliver it to you.
A new world of plugins
Now, every plugin download, even from the Gephi Application itself go through the marketplace! It means that as soon as a plugin goes live on our new platform, it’s directly available to every Gephi user around the world. Even if you haven’t made a plugin yourself, chances are you’ve already used the marketplace without realizing it!
Speaking of around the world, here’s a geographic breakdown of Gephi plugin downloads within the past 30 days:
Isn’t it impressive? 1169 plugin downloads and every continent is represented! (ok, not every single one. Props to the first Gephi user who’ll download a plugin from Antartica!)
It makes us extremely proud to help and foster such a vibrant community of Gephi users around the world. Remember that this map is for the last 30 days alone!
Today, we have 38 plugins listed on the marketplace, and we’ve made sure you can post your creations as easily as possible. It takes no more than 5 minutes to first fill in the plugin upload form, and we usually get back to you within 24 hours, so you can make your work available to thousands in record time!
A world of services
As briefly mentioned before, we’re also listing Gephi Services on the marketplace.
Over the years, we’ve noticed that a community of professionals and enthusiasts started providing services to the community. We’ve seen people doing trainings on Gephi, others integrating Gephi into data analysis workflows, or even create studies using Gephi as their tool of choice. We thought it was about time these fine folks got easier to find!
Every service provided by the community, for free or with a fee, is welcome to be listed. We already have a few, and to be honest we would love to see more of them!
Listing the services you provide on the Gephi marketplace is easy, and like with listing plugins it’s completely free, regardless of what you choose to charge for your work. We’re not here to collect fees. We just want to help other people know about your work, and to help you easily be found and easily be reached.
Following the same simple form submission process you can use for plugins, you give us all the info on what you’ve got for the community, you click ‘upload’, we get back to you, and then you have yourself a listing!… Which you can then customize to your liking, and for which you can include every single bit of information you want, and even a great copy! So not only can you let everyone know about what it is that you do, but also how truly wonderful it is to work with you and your expertise. So?
And now back to you.
Any question? Feel free to comment below! We’ll be glad to help you get started. And if you’d like to be kept in the know of what’s newly getting posted on the Marketplace, the RSS feed is this way.
In the meantime, thank you for your continued trust and your use of Gephi as your Graph Exploration and Manipulation tool of choice! It is our humble pleasure to serve you all.
See you soon on the marketplace!