Introduction
Secrets management is a difficult problem. Every organization has a vast sea of secrets to deal with, many of which need to be used collaboratively. With so many hands touching so many secrets, they will inevitably be exposed; either intentionally or unintentionally. No organization is immune: this issue affects companies and groups of all sizes and forms. Unfortunately, data repositories containing secrets can be virtual treasure chambers for malicious actors and the impact of their exposure can be catastrophic.
Where does it happen?
Any places where people store large quantities of data are potentially at risk for improperly stored credentials. Examples of these are GitHub and Bitbucket repositories, SharePoint servers, open file shares, Pastebin posts, and other similar storage locations. This can also include log file aggregators such as Splunk if sensitive data is being inadvertently logged. In this post, we’ll focus on GitHub, but the general concepts discussed here apply to any storage repository to which a malicious actor may be able to gain access.
Some may counter that “we only have internal instances of these repositories“. While this lessens the potential attack surface in some respect, it’s not a guarantee. Internally hosted instances are often at a much higher risk due to insider threats and authorized users becoming more complacent as they believe their data is less exposed on internal systems.
Why does it happen?
Secrets end up in data repositories for many reasons. The first is simply apathy. Some companies don’t prioritize protecting credentials or don’t want to dedicate resources to ensuring that they are protected properly. As a result, a culture develops where data repositories are used as a catch-all storage environment without regard for safety or the implications of what is being stored or exposed there. Often, due to pressure to complete projects, developers are encouraged to “just get things done” and as a result, sensitive credentials are included in scripts and software to save time.
A related reason is a lack of proper education around protecting secrets. Developers or project managers are more focused on completing their tasks and not considering that storing secrets in code or configuration files may expose them. They may not be fully aware of who has access to repositories and in that way, secrets can become exposed.
Similarly, accidental commits of secrets frequently happen. Because codebases can be complicated, files containing secrets meant to be only used locally can be accidentally committed to a repository. These might be configuration files, local versions of code being tested that have hard-coded secrets, or local test versions not meant to be pushed to the codebase. As a result, the secrets are committed to the data repository and sometimes are accidentally overlooked or, if caught, may not be fully removed from the commit history. This often happens when a developer is working on personal code projects with a computer they are also using for work purposes. As a result, work files containing secrets are accidentally posted to a public source code repository, which can have devastating effects if not caught.
Finally, we have scenarios involving malicious intent. Insider threats or malicious actors may use data repositories, especially public ones, to intentionally expose secrets or to exfiltrate sensitive data from an organization. In the case of accidental commits to personal code repositories, it can be difficult to prove malicious intent and as such, can provide plausible deniability.
Why is it bad?
It may seem obvious, but exposed secrets can lead to terrible consequences for any organization. One good example is the recent SolarWinds breach. Although it’s been shown that there were multiple vectors of compromise, one which was suggested early on was exposed credentials in GitHub. While it is unproven that this was actually involved in the breach, the possibility is certainly there. If an attacker can easily get the keys to the kingdom, why would they bother breaking down the front door? There’s no point in brute forcing accounts or trying high-tech exploits when you can just log in using valid credentials or keys like a normal user.
Some might say that leaked secrets aren’t always a big deal. Certainly, a minor internal service account isn’t likely to directly lead to a major breach. However, exposed information about an organization’s internal networks, naming schemes, username formats, email formats, and other seemingly inconsequential information can be combined to provide a much clearer “big picture” for a malicious actor. Even if the information being exposed isn’t “secret” per se, it can be used in concert with other leaked information to build a better attack against the organization.
So what can we do about it?
Thankfully, there are some options to prevent secret leakage or to at least deal with them once exposed. Some automated solutions exist to check code commits for secrets. GitHub itself provides a tool for this type of protection here. There are also some open-source options for scanning repositories manually. Some examples of these are GitRob or TruffleHog.
Unfortunately, automated solutions aren’t magic bullets. Automated detection-on-commit tools can’t detect things like rogue developers putting secrets into personal repositories or obfuscating sensitive information for exfiltration. Manual detection tools have their own problems. Some tools require the user to clone the repository first in order to scan it, something which can be unwieldy when trying to search for relevant credentials across an entire organization. Others require specific repositories or organizations to be specified. This works well when the desired focus is known, but it may be necessary to look for secrets across the entire data repository system and not just in select repositories or organizations.
Both types of tools suffer from difficulties in detecting secrets in general. Like trying to write antivirus signatures, trying to write code to detect secret formats, in general, is an impossible task. Secrets and related code take many forms and styles differ between developers. Trying to normalize them for detection is extremely difficult. Many of these tools look for “high-entropy” strings as a starting point for finding secrets. This can be useful, but as we know, not all secrets fall into that category, “solarwinds123” being a good example.
The proof is in the pudding: we still see secrets in public and private GitHub systems, even today, despite these tools existing. It’s getting better, but still not good enough. This is why, for a pentester or defender, having the knowledge to manually search for secrets is important.
How can we organize a search for secrets?
Searching for credentials is very similar to doing threat hunting, for those readers who work on the blue-team side of the fence. If interested in threat hunting, I highly recommend Chris Sanders’ “Practical Threat Hunting” course. I’ve used some of those threat hunting concepts as a basis in this article for understanding manual searching for secrets, specifically those of “Attack Based Hunting” as described in the course, but modified to be used here.
In Chris’ course, he discusses the idea of generating a hypothesis to focus the search. To do this with secrets hunting, we determine a scenario we would like to test: i.e. “I believe X company has credentials in their repositories and those credentials are likely to be in a username/password format.” The next step is to determine what a manifestation of that scenario would look like. This refers to specifically how those secrets would be represented in the code: which keywords might be used, what the code format might look like. By doing this, we can start the search in a constrained manner in order to more efficiently deal with a large number of results likely to be returned.
Searching in a constrained manner isn’t the only approach, however. If the goal isn’t to focus on a specific organization or target, and the desire is to look for secrets regardless of the source or use, then generic keywords are fine. However, for time-boxed assessments as opposed to searching for secrets for the good (or perhaps detriment) of humanity, a focused search is probably more appropriate.
Finally, one last concept which is used in threat hunting is the idea of pivoting. In threat hunting, pivoting is when we switch from one data source to another based on common field data such as a hostname or ID which is present in records in disparate data sources. For secret hunting, we can “pivot” based on found information. Let’s say we search for a company name and the terms “username” and “password”. In some of the results, we may find internal hostnames specific to the target. We can then search on those hostnames to find records that may not explicitly mention the target’s company name but have relevant sensitive information. The common link between these records is the internal hostname which was found during the initial search. These sorts of pivots can lead to important discoveries.
What are some manual practical approaches for searching?
While there may be benefits to searching in a general way for secrets in data repositories, it’s much more efficient to focus the search on a specific target. When starting a search, using common OSINT techniques to gather information is a useful first step. One option is to focus on gathering information about what sort of technology is used by the target. This can help narrow the search in terms of which keywords and code structures to look for. Any other target information about internal hostnames, email addresses, IP ranges, domain names, and naming conventions can help improve the results of later searches.
One commonplace to start is subdomain enumeration and searching using those subdomains. Subdomains can be enumerated by using tools such as TheHarvester or using websites like shodan.io or dnsdumpster.com.
More in-depth OSINT research can involve identifying developers at the target organization in question and finding their public GitHub repositories, either professional or personal. If personal repositories are found, look for connections between them and the target organization, such as common users committing between repositories and projects. Often, developers at a target organization will mix commits between their professional and personal accounts, allowing connections to be identified. This type of searching is often only warranted if previous searches have been unfruitful, due to the amount of time investment required.
Internally hosted instances of GitHub or other data repositories are likely to hold more unprotected secrets, but that doesn’t mean that even public GitHub doesn’t have secrets to find.
Be sure to take into account the age of results found, especially in publicly hosted GitHub or other repositories. Newer commits of code are much more likely to be relevant or still valid. However, don’t necessarily throw out old ones as people who post secrets in GitHub often don’t rotate them. They may still be valid even years later.
Another factor to consider is that there will be a lot of unrelated, misleading, or spurious data. This grows exponentially as the profile of the target grows. For example, if you have a small, relatively unknown company as a target, results with target-specific keywords are likely to be valid. However, if you’re searching for a large corporation, you’re much more likely to get unrelated or distracting results. Frequently there are instances of people doing research on the target’s websites and collecting data which is then put into GitHub, which can cause a lot of unnecessary results. There might also be applications from external developers trying to interface with the target’s website APIs, which can cause confusion. Even if valid target repositories are found, sometimes results are returned which have placeholder results like “password = ‘password'” which can seem like valid results but require an experienced eye to filter out. This will come with practice. It’s important to note, however, that sometimes the password is actually “password”. 😉
The key to filtering out bad data is to match keywords that are likely to get the data you want (i.e. private keys, passwords, etc.) with target-differentiating keywords. These should be something that will if possible, concretely identify results that are specific to the target and no one else. For example, combining the keyword “passwd” with a unique hostname we’ve associated with the target through OSINT research. This will help remove the “chaff” in the results and better focus our efforts. It’s important to be aware that this may focus the search too much at times and should be backed off somewhat if searches prove unfruitful.
Additionally, if you find results that look like they should hold secrets and don’t, or which may have held them in the past, check the commit history. Often, developers will just commit a change to the code and deem the exposed secret to be fixed. They don’t consider the fact that the old password is still in the history entries. Even if they do realize the danger, removing code commits from history can be a time-consuming and difficult task. Some developers or users feel that the time needed to spend to fix the issue isn’t worth the risk caused by the secret exposure. Also, often the secrets are not rotated, but only the code commit is “fixed”. Thus, the secrets in history are still valid and available for use with some minor effort.
Now that the higher-level aspects of searching have been covered, let’s discuss more of the hands-on approach to searching. This is more of an art than a science, so some experience and intuition certainly helps. One useful approach is through GitHub “dorking”, a method of searching GitHub using keywords to narrow down search results effectively. An excellent list of popular GitHub dorking keywords is: https://github.com/techgaun/github-dorks.
Picking specific file types which are likely to contain secrets can be quite effective. These might be configuration files (*.ini, *.cfg) or files specifically intended to contain secrets, like SSH keys (*.pem). This could also include system files such as Ansible playbooks, bash scripts (*.sh), bash history files (.bash_history). Search results can be filtered by choosing the desired file type on the left side of the page.
One excellent approach is to anticipate keywords surrounding secrets in repository files and search for those. This may return many false positives and thus require more time, but can be extremely effective in acquiring valuable secrets. This might include searching for keywords like “JDBC”, “username”, “password”, “IDENTIFIED BY” (for credentials in sql commands), or “expect” (which is used when passing passwords via command line to an SSH command).
There are too many specific techniques to list here comprehensively. This article is meant to show higher-level approaches and guide the reader toward understanding the results they see and develop a useful strategy on the fly to get a better return on their time investment when searching.
Next, we will look at practical examples of searching to illustrate these concepts in the real world in part 2…
MORE FROM WHITE OAK SECURITY
White Oak Security is a highly skilled and knowledgeable cyber security testing company that works hard to get into the minds of opponents to help protect those we serve from malicious threats through expertise, integrity, and passion.